My students run a journal club that meets once a week and last Friday the paper we discussed was M. Imakaev *et al., *Iterative correction of Hi-C data reveals hallmarks of chromosome organization, Nature Methods 9 (2012), p 999–1003. The paper, from Leonid Mirny’s group, describes an approach to analyzing Hi-C data, and is one of only a few papers that have been published focusing on methods for this *Seq assay. Present at journal club were my students Nicolas Bray, Shannon Hateley, Harold Pimentel, Atif Rahman, Lorian Schaeffer, Akshay Tambe, Faraz Tavakoli, postdocs Sreeram Kannan and Shannon McCurdy, and guest Emily Brown from Doris Bachtrog’s group. They all read the paper prior to our meeting, and the notes below are a summary of our discussion (although I have taken some more time to flesh out the math for the interested reader). For those who are impatient, the bottom line is that I think the title of the paper by Imakaev *et al.* should have had a few extra words (mine in italics) so that it would be “Iterative *proportional fitting of a log-linear model for* correction of Hi-C data reveals hallmarks of chromosome organization”.

The main (new) result in the Imakaev *et al.* paper is an iterative procedure for “correcting” Hi-C data. Following the correction, an eigendecomposition is proposed for the resulting “relative contact probability” matrix. The latter step is essentially the analysis idea from the original Hi-C paper by Erez Lieberman-Aiden *et al.* Together, these steps form a pipeline for analysis of Hi-C data that the authors term **I**terative **C**orrection and **E**igendecomposition (ICE). Although we discussed the “E” of ICE at group meeting (and I have some questions about it), I don’t get into it in this post. I focus on the iterative correction, which in the notation of the paper is as follows:

First, the genome is divided into *n* bins (or regions) from which an *nxn* matrix *O *of contact counts is obtained. In this matrix, the *ij*-th entry is the number of “double-sided” (DS) reads that mapped respectively to the *i*th and *j*th bins. The counts are obtained by mapping the reads to the bins, which are defined to be regions of size either 1-Mb or 200-kb. The paper also considers “single-sided” reads, which arises from contacts between mappable and unmappable parts of the genome, but for simplicity we don’t discuss their treatment here. A key point is that the matrix *O* has its diagonal and one-off diagonal elements zeroed out. The rationale, according to the paper, is that “within a bin and with adjacent bins, … contacts occur at scales finer than the given resolution”. In other words, because the matrix *O* is symmetric, the data consists of numbers.

Given *O*, the goal is now to find another *nxn* matrix *T* with zeroes on the diagonal and off-diagonal satisfying the *n* constraints , together with a vector of length *n* called *B*, so that the constraints

(1)

are satisfied. In other words, there are indeterminates to solve for using the same number of constraints. The intuition provided in the paper is that the vector *B* captures bias of various kinds, leaving the interaction terms *T* as the “corrected” data. More on this in a moment.

A procedure is provided for solving for *B *and *T: *Begin by setting a matrix and the vector to be 1. Then perform the iterations:

- Calculate and let .
- Set .
- Set .
- Set .

The hope is that the converge to *B* and the to *T.*

We were initially a bit perplexed by this method. First, the authors never prove convergence of the iterative algorithm. They merely state that “for many maps, iterative correction converges in around 10 iterations. For certain cases, however, further iterations are required to pull in outliers” so they use 50 iterations because it is “twice the number required for the worst case encountered”. Second, although its easy to verify that a solution of equation (1) is a fixed point of the iteration, it is not clear that there is a unique solution. Third, neither the equation (1), nor the description of the iterative procedure provide an answer to the question *why this method? *The authors provide some explanation for what the iterative algorithm might be accomplishing in the Supplement (page 21) but as I discuss below, it is incomplete. The motivation for the exact form of equation (1) is basically the empirical “validation” that it appears to do good things in practice and makes intuitive sense. Notably, the claim in the abstract that the method yields”genome-wide maps of relative contact probabilities” is true only in definition. That is to say, the authors have defined the matrix *T *by equation (1) and then called it the “map of relative contact probabilities”. Moreover, as explained above, it is completely unclear that the iterative procedure even recovers a *T* satisfying equation (1).

So what are they really doing, and what is going on?

As a warmup, it helps to think about a model that leads to a simplification of equation (1). Suppose that instead of asking for a decomposition of as , we instead adopt the following mindset: we imagine that the counts are the result of observing Poisson (or multinomially) distributed random variables where for some vector *B* whose elements sum to 1 and constant *N**. * This model is called the *model of independence *because the observed counts *O* can be viewed as being generated by repeating *N* times the process of choosing twice (independently) random indices according to the distribution *B. *It is a special case of a *log-linear *model due to the following description: a count in cell *i,j *can be modeled as having occurred with probability where , and *Z* () is a normalization factor ensuring that the probabilities *p* sum to 1.

Estimation of the maximum likelihood parameters of this model has been studied extensively in the statistics literature. For starters, if the row and column sums of *O* are positive, and *O* can not be permuted into block-diagonal shape, then there is a unique maximum to the likelihood function. In my book Algebraic Statistics for Computational Biology, coauthored with Bernd Sturmfels, we provided an introduction to log-linear models, also called toric models, with a presentation of Birch’s theorem (our Theorem 1.10) which generalizes the result above.

Assuming the MLE is unique, the following algorithm will converge to it: Set and then iterate

.

This algorithm is known as *iterative proportional fitting.** *It will produce a matrix of rank 1 with the same row and column sums as *O,* i.e. it will factor as

.

Iterative proportional fitting has a long history starting with its introduction by W. Edwards Deming and Frederick F. Stephan in 1940 in the paper “On a least squares adjustment of a sampled frequency table when the expected marginal totals are known“, published in the Annals of Mathematical Statistics. The algorithm is motivated by computing the roots of the partial derivatives of the log likelihood function. In the setting above, assuming a Poisson model, we have that

where *C* is a constant. From this we obtain that so that . In other words, at the maximum likelihood setting of the parameters, it must be the case that *the model margins equal the observed margins.* This is the observation of Imakaev *et al. *on page 21 of the supplement where the partial derivatives are computed. However this doesn’t explain how to solve for the maximum likelihood parameters. To get an algorithm, we multiply both sides of the margin identity by to obtain

,

or

.

Fixing the right hand side, while solving for the left hand sides, leads to the recursion

.

which is exactly iterative proportional fitting (IPF).

A standard extension of the independence model above is to add an interaction term, so that . This model is still log-linear. Upon exponentiation, and setting , we obtain the likelihood function that appears in the Supplement of Imaev *et al. *(page 21, equation starting *LL*), or equivalently the constraints

.

Applying the principle explained above for deriving IPF to the partial derivatives, one arrives at the iterative correction of Imakaev *et al. *that I described in Steps 1–4 above.

The understanding that Imakaev *et al.* have rediscovered IPF in the context of a specific log-linear model is useful. First, there is a large (complicated and non-trivial) literature on the convergence properties of IPF, as well as a lot of work “explaining” the algorithm. For example it has the interpretation of minimizing the KL divergence from the observed distribution to the model distribution, which leads to an interpretation of IPF as coordinate ascent in the likelihood. Furthermore, there is work on regularization which could very well be useful in the context of Hi-C data. The work on IPF also shows that the convergence claims of Imakaev *et al.* depend on assumptions about the non-zero entries and row/column sums in the observation matrix, a fact that is glossed over in the paper and that might come into play as Hi-C is applied in different settings. Sometimes ignorance is *not* bliss!

Having said that, the take home message is that the Imakaev *et al.* iterative correction makes sense, and (fortunately) there is an established theory underpinning the algorithm. For these reasons I like the paper, and I also appreciate the open source distribution of the ICE software by Mirny’s group. However I can’t help but wonder what reviewer #3 would have written if the paper had been submitted with the extended title I have proposed, an acknowledgment that underlying the iterative correction in the ICE pipeline is an application of a standard inference algorithm (IPF) to a standard log-linear model for interaction, and references to the large body of literature on IPF.

## 16 comments

Comments feed for this article

November 17, 2013 at 7:46 pm

Marcin CieslikNever heard of IPF, but the algorithm looks similar to multiplicative algorithms for non-negative matrix factorization (specifically its quadratic extension (i.e. B occurring twice in eq. 1; http://www.sciencedirect.com/science/article/pii/S0031320311004389). Is there a connection?

November 17, 2013 at 9:20 pm

Lior PachterThere is a close connection. In the simple (independence model) example I work out in the blog post, the Iterative Proportional Fitting (IPF) algorithm is exactly the same as the Lee-Seung (LS) algorithm for non-negative matrix factorization with rank 1. However IPF generalizes to inference for Markov Random Fields (or log-linear models) whereas LS generalizes to inference of parameters W,H in M=WH, which is certainly not log-linear.

November 18, 2013 at 5:12 am

Marcin CieslikThanks so much!

November 26, 2013 at 8:47 am

Noam KaplanI would like to propose a different take on this.

This procedure is actually the well-known Sinkhorn-Knopp balancing algorithm, which has been “rediscovered” multiple times and has been known to be in use as early as the 1930s. The idea is that given a Symmetric non-negative matrix A, we would like to find a diagonal matrix D with positive diagonal entries such that B=DAD is doubly stochastic, i.e. the sum of the rows and the sum of the columns of B is 1. The proposed procedure is exactly the fixed point iteration, and the process will converge (with some exclusions). Each iteration i essentially constructs its own diagonal matrix Di, and the final D is the multiplication of these matrices. A nice detailed explanation can be found here:

http://www.mathstat.strath.ac.uk/downloads/publications/skapp.pdf

What is nice is that it makes the final interpretation straightforward: The matrix D gives the “bias factor” of each row/column, so that Bi,j=Di,i*Dj,j*Ai,j

which explicitly shows what this model is assuming.

Disclosure: I am currently a member of the Dekker Lab, but was not involved in the writing of this paper (I just joined then). These views are based on my own research/interpretation of the procedure and may not represent those of the authors.

November 26, 2013 at 8:07 pm

Lior PachterHi Noam,

Thanks for pointing out the connection to the Sinkhorn-Knapp algorithm- I hadn’t thought of that. I do have two comments:

1. First, just a small correction to your post: in the ICE procedure (and using your notation) it is B that is given and A that is being estimated. This is ok, because if B=DAD then A=D^{-1}BD^{-1}. But there is an inverse required.

2. I personally prefer the IPF interpretation of what is going on, because it is much more general and furthermore provides a statistical meaning (via the idea that IPF learns parameters of Markov Random Fields). In fact, I think that Sinkhorn himself generalized the original algorithm to what was equivalent to the original IPF algorithm in his 1967 paper http://www.jstor.org/stable/2314570 and as Feinberg points out in http://www.jstor.org/stable/2239244 Sinkhorn was just rediscovering IPF. In any case, regardless of who was rediscovering who, I think that the statistical interpretation (and derivation) of IPF provides more meaning than the Sinkhorn-Knapp algorithm which comes from a numerical linear algebra perspective. Having said that, the more ways there are to look at an idea, the better one can understand it. So I very much appreciate your comment.

Lior

November 27, 2013 at 7:47 am

Noam KaplanI also liked the statistical interpretation – thanks for this interesting blog post.

Another point that might be worth mentioning, is that these types of balancing procedures often reduce the condition number of the matrix (e.g. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.1679&rep=rep1&type=pdf ), which could be a good thing (numerically) further down the line when you do things like eigendecomposition.

March 6, 2014 at 6:43 am

Martin GerberPlease correct me if I am wrong, but I claim that there is a flaw in the stated algorithm, and further that it does not equal the IPF procedure.

Let the real symmetric non-negative matrix $O$ have all row sums equal to $c$ for some positive constant $c \ne 1$. Then we have that $S_i = c$ for all $i$, and further that $\overline{S} = c$. Thus we get that $\Delta B_i^k = 1$, hence $B_i^k = 1$, and $W^k = O$, for all $k$. This implies that $W^{\infty} = O$. Since $O$ does not provide row sums equal to $1$, this trivial limit differs from any existing solution $T$.

For example, for $O=[0 2; 2 0]$ we get for all $k$ that $B^k=[1 0; 0 1]$ and $W^k = [0 2; 2 0]$, which does not provide unit row sums. However, IPF would converge to the (inverse of the) correct solution, which is $B^k = [sqrt(2) 0; 0 sqrt(2)]$ and $T = [0 1; 1 0]$, yielding $O = B * T * B$.

Probably this flaw occurs from the fact that the IPF procedure *alternately* maps row sums and column sums to their intended values, while the update rule for the $\epsilon_{ij}$ in your post does only consider the row summation part. It seems as if the HiC-authors wanted to provide a multiplicative symmetrization of the IPF procedure in order to yield unit row sums, but their specific approach fails.

April 25, 2014 at 3:07 am

Jonathan CairnsI think this comment is correct in that IPF should alternate between rows and columns.

May 12, 2014 at 1:16 pm

Maxim ImakaevIt’s great to see other people discussing our work, and to have an informal place to discuss papers. We indeed did not know about iterative proportional fitting, and we thank Lior for pointing us towards this. We also would like to reply to the comments and questions posted on this blog.

Reply to Jonathan: Hi-C matrices (whole-genome and single chromosome contact maps) are symmetric, therefore there is no need to alternate between rows and columns.

Reply to Martin: Our algorithm converges to a matrix which has a constant as a sum of any row/column (this roughly preserves the overall sum of the matrix, which is useful for certain analyses). The matrix can be then divided by this constant, and then it would be normalized to one.

May 13, 2014 at 10:25 pm

Lior PachterDear Maxim,

Many thanks for taking the time to comment on the post. I appreciate your reply. Thanks again,

Lior

May 31, 2015 at 4:36 am

JamesMiljanovicI have read so many articles on the topic of the blogger lovers however this paragraph is genuinely a good article, keep it up.

January 17, 2017 at 7:47 pm

jasontanThe reason I am reading this is to understand if the algorithm produce the “right” contact map even when it did not converge?

January 17, 2017 at 8:02 pm

Maxim ImakaevJasontan, do you have an example of when the algorithm does not converge? I haven’t encountered it on Hi-C matrices…

January 17, 2017 at 11:01 pm

jasontanI have been trying to apply ICE normalization on my mouse Hi-C data (700 million paired-end reads) digested with DpnII (4-base cutter). Mapping and fragment filtering were done with HiCUP while raw interaction matrix was computed using Homer. The observed matrix was fed into “normICE” function in HiTC package with 500 maximum iterations. I was surprised to find that it did not converge.

January 17, 2017 at 11:02 pm

jasontanTo add on, I have tried 100Kb resolution, 1Mb resolution.

January 18, 2017 at 10:51 am

Maxim ImakaevUnfortunately, I’m not familiar with the HiTC package…

An important processing to do before running ICE is to remove (or set to zero) all bins with less than some number of reads (e.g. 50), and remove all bins with less than some number of non-zero entrees (e.g. 10). If those bins are not removed, it creates “standalone” pixels in the map in the rows that are almost completely zero. Those “standalone” pixels can leave to non-convergence, and even if the algorithm converge the matrix sometimes ends up being incorrect.

Can you try feeding your matrix at 1MB or 100kb resolution to the mirnylib.numutils.completeIC method of the mirnylib package? It will do these filtering steps for you. If you send it to me I can run it for you as well.

We are also developing a cooler package that can so ICE on sparse matrices. You can try it as well.

Cooler: https://github.com/mirnylab/cooler/tree/master/cooler

Mirnylib: https://www.google.com/search?q=mirnylib&oq=mirnylib&aqs=chrome..69i57j0.1546j0j7&sourceid=chrome&ie=UTF-8