You are currently browsing the tag archive for the ‘Ukkonen’ tag.

Today I came across an arXiv posting from yesterday:

Theoretical bounds on mate-pair information for accurate genome assembly by Henry Lin and Yufeng Shen (arxiv.org/abs/1310.1653, October 7 2013).

It purports to provide a sufficient condition for genome assembly, when in fact it is a rediscovery of Ukkonen’s condition for assembly (and a poor rediscovery at that). I’ve seen this happen before so I thought I’d make a short post about this.

Ukkonen’s paper is

E. Ukkonen, Approximate string matching with q-grams and maximal matches, Theoretical Computer Science 92 (1992), no. 1, 191–211.

The connection to genome assembly is explained well in a recent paper by Guy Bresler, Ma’ayan Bresler and David Tse: Optimal assembly for high-throughput shotgun sequencing (arxiv.org/abs/1301.0068, 18 February 2013).

The following figure from the Bresler et al. paper illustrates Ukkonen’s condition:

The condition is that the read length L must be at least one greater than the maximum length of any interleaved repeat or triple repeat. In the figure above (Figure 4 in the Bresler et al. paper), it is impossible to distinguish the two different assemblies (that merely interchange the sequence between the interleaved repeats) with reads of length L. There is a similar example of switching for triple repeats.

The lower bound of Lin and Shen essentially recapitulates the above condition. Moreover, the upper bound is trivial: it merely describes a sufficient library design that if sequenced completely would allow for assembly. Bresler et al. go much farther than this, and describe an approach to optimal assembly under a random model of (shotgun) sequencing that in particular, requires shotgun generalizations of Ukkonen’s combinatorial condition. In a future post I will describe in much more detail the Bresler et al. results. For now, it suffices to say, Ukkonen should be essential (starting) reading for anyone working on assembly.

Update (October 8): the authors have graciously contacted me and acknowledged their oversight in not referring to Ukkonen’s work. They will update their manuscript and repost to the arXiv. They also pointed out that their paper discusses specifically paired-end reads, and that this requires a slightly different analysis than Ukkonen’s initial work (a good point that I should have mentioned in the original post). I applaud them for (a) posting the initial result on the arXiv thereby allowing the community to look at their work prior to publication and (b) for responding immediately to improve it given the feedback.

### Recent Comments

 Javier on I was wrong (part 3) Artem Barski on Open sourcing bioinstruments Uri David Akavia on Open sourcing bioinstruments Eduardo Da Veiga Bel… on Open sourcing bioinstruments Rick Farouni on Introduction to single-cell RN… Simone Tiberi on Fast and accurate gene differe… intrust antivirus so… on Yuval Peres Simone Tiberi on Fast and accurate gene differe… davidwlocke on The two cultures of mathematic… Artem Barski on Open sourcing bioinstruments

### Blog Stats

• 2,121,410 views