Today I came across an arXiv posting from yesterday:

Theoretical bounds on mate-pair information for accurate genome assembly by Henry Lin and Yufeng Shen (arxiv.org/abs/1310.1653, October 7 2013).

It purports to provide a sufficient condition for genome assembly, when in fact it is a rediscovery of Ukkonen’s condition for assembly (and a poor rediscovery at that). I’ve seen this happen before so I thought I’d make a short post about this.

Ukkonen’s paper is

E. Ukkonen, Approximate string matching with q-grams and maximal matches, Theoretical Computer Science 92 (1992), no. 1, 191–211.

The connection to genome assembly is explained well in a recent paper by Guy Bresler, Ma’ayan Bresler and David Tse: Optimal assembly for high-throughput shotgun sequencing (arxiv.org/abs/1301.0068, 18 February 2013).

The following figure from the Bresler *et al.* paper illustrates Ukkonen’s condition:

The condition is that the read length *L* must be at least one greater than the maximum length of any interleaved repeat or triple repeat. In the figure above (Figure 4 in the Bresler *et al.* paper), it is impossible to distinguish the two different assemblies (that merely interchange the sequence between the interleaved repeats) with reads of length *L. *There is a similar example of switching for triple repeats.

The lower bound of Lin and Shen essentially recapitulates the above condition. Moreover, the upper bound is trivial: it merely describes a sufficient library design that if sequenced completely would allow for assembly. Bresler *et al. *go much farther than this, and describe an approach to optimal assembly under a random model of (shotgun) sequencing that in particular, requires shotgun generalizations of Ukkonen’s combinatorial condition. In a future post I will describe in much more detail the Bresler *et al.* results. For now, it suffices to say, Ukkonen should be essential (starting) reading for anyone working on assembly.

**Update** **(October 8**): the authors have graciously contacted me and acknowledged their oversight in not referring to Ukkonen’s work. They will update their manuscript and repost to the arXiv. They also pointed out that their paper discusses specifically paired-end reads, and that this requires a slightly different analysis than Ukkonen’s initial work (a good point that I should have mentioned in the original post). I applaud them for (a) posting the initial result on the arXiv thereby allowing the community to look at their work prior to publication and (b) for responding immediately to improve it given the feedback.

## 2 comments

Comments feed for this article

October 8, 2013 at 6:21 pm

Henry LinHi Lior, thanks for taking the time to read the result. I have updated the draft with the references added and the change should be reflected on arxiv within a couple days. My work is distinct from the previous works by Ukkonen and David Tse, which only consider single-end reads, while we focus on sequencing with mate-pair information. Sequencing with mate-pairs creates subtle differences from single-end reads, which must be accounted for in the proofs. For example, sequencing mate-pair reads of length L and insert sizes of length 0, L, 2L, … , RL is not equivalent to having single end reads of length RL. The latter case with longer reads actually provides more information (assuming uniform coverage over the genome), and one can construct an example where reconstruction is possible in the latter case, but not in the former case. Also, we consider this result as a starting point. We would like to extend the results to show for example that empirically (e.g. for human genome) perhaps not as much mate-pair information is necessary, and/or consider the setting in which the mate-pair insert sizes are not exactly known. We appreciate any additional comments.

January 6, 2014 at 2:49 pm

Henry LinI have updated the paper with a new result describing conditions under which genome assembly can be guaranteed with mate-pairs of doubling insert-sizes (100, 200, 400, etc.). This last result is more practical than the previous results by allowing less mate-pair information to be used. The updated paper has been submitted for review, and I hope the submitted paper can receive an unbiased review. I still disagree with the claim that the result is a rediscovery of Ukkonen’s condition, which only holds for single-end reads, but I do recognize that initial oversight in citing Ukkonen’s condition and the work of Bresler, Bresler, and Tse, which has been corrected. As always, I do appreciate constructive feedback.