Today I came across an arXiv posting from yesterday:

Theoretical bounds on mate-pair information for accurate genome assembly by Henry Lin and Yufeng Shen (arxiv.org/abs/1310.1653, October 7 2013).

It purports to provide a sufficient condition for genome assembly, when in fact it is a rediscovery of Ukkonen’s condition for assembly (and a poor rediscovery at that). I’ve seen this happen before so I thought I’d make a short post about this.

Ukkonen’s paper is

E. Ukkonen, Approximate string matching with q-grams and maximal matches, Theoretical Computer Science 92 (1992), no. 1, 191–211.

The connection to genome assembly is explained well in a recent paper by Guy Bresler, Ma’ayan Bresler and David Tse: Optimal assembly for high-throughput shotgun sequencing (arxiv.org/abs/1301.0068, 18 February 2013).

The following figure from the Bresler et al. paper illustrates Ukkonen’s condition:

Ukkonen

The condition is that the read length L must be at least one greater than the maximum length of any interleaved repeat or triple repeat. In the figure above (Figure 4 in the Bresler et al. paper), it is impossible to distinguish the two different assemblies (that merely interchange the sequence between the interleaved repeats) with reads of length L. There is a similar example of switching for triple repeats.

The lower bound of Lin and Shen essentially recapitulates the above condition. Moreover, the upper bound is trivial: it merely describes a sufficient library design that if sequenced completely would allow for assembly. Bresler et al. go much farther than this, and describe an approach to optimal assembly under a random model of (shotgun) sequencing that in particular, requires shotgun generalizations of Ukkonen’s combinatorial condition. In a future post I will describe in much more detail the Bresler et al. results. For now, it suffices to say, Ukkonen should be essential (starting) reading for anyone working on assembly.

Update (October 8): the authors have graciously contacted me and acknowledged their oversight in not referring to Ukkonen’s work. They will update their manuscript and repost to the arXiv. They also pointed out that their paper discusses specifically paired-end reads, and that this requires a slightly different analysis than Ukkonen’s initial work (a good point that I should have mentioned in the original post). I applaud them for (a) posting the initial result on the arXiv thereby allowing the community to look at their work prior to publication and (b) for responding immediately to improve it given the feedback.