I recently read a “*brainiac”* column in the Boston Globe titled “For 40 years, computer scientists looked for a solution that doesn’t exist” that caused me to facepalm so violently I now have pain in my right knee.

The article is about the recent arXiv posting:

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) by Arturs Backurs and Piotr Indyk (first posted in December 2014 and most recently updated in April 2015).

In the preprint, which was presented at STOC, Backurs and Indyk show that computing edit distance sub-quadratically is hard in the sense that if it were possible to compute the edit distance between two strings of length *n *in time for some constant then it would be possible to solve the satisfiability of conjunctive normal form formulas in time that would violate the *Strong Exponential Time Hypothesis *(SETH). This type of “lower bound” reduction is common practice in theoretical computer science. In this case the preprint is of interest because (a) it sheds light on the complexity of a computational problem (edit distance) that has been extensively studied by theoretical computer scientists and (b) it extends the relevance of SETH and sheds more light on what the hypothesis implies.

In the introduction to the preprint Backurs and Indyk motivate their study by writing that “Unfortunately, that [standard dynamic programming] algorithm runs in quadratic time, which is prohibitive for long sequences” and note that “considerable effort has been invested into designing faster algorithms, either by assuming that the edit distance is bounded, by considering the average case or by resorting to approximation”. They mention that the fastest known exact algorithm is faster than quadratic time running in , but note that “only barely so”. This is all true, and such preamble is typical for theoretical computer science, setting the stage for the result. In one single parenthetical remark intended to relate their work to applications they note that “the human genome consists of roughly 3 billions base pairs” and that exact edit distance computation is currently prohibitive for such long sequences. This is a true sentence, but also not very precise. More on that later in the post…

The parenthetical genome comment in the preprint was picked up on in an MIT press release describing the results and implications of the paper. While the press release does a very good job explaining the reduction of Backurs and Indyk in an accessible way, it also picks up on the parenthetical genome comment and goes a bit further. The word “genome” appears three times, and it is stated that “a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes.” This is technically true, but two *human* genomes are very similar, so that exhaustive comparison is not necessary to compute the edit distance. In fact, two human genomes are 99.9% identical, and with strong prior upper bounds on the edit distance the computation of edit distance is tractable.

In terms of different *species*, the edit distance between complete genomes is completely irrelevant, because of the fact that organisms undergo large scale segmental duplications and rearrangements over timescales of speciation. The MIT press release did not claim the Backurs and Indyk result was relevant for comparing genomes of species, but… the Boston Globe column states that

“By calculating the edit distance between the genomes of two organisms, biologists can estimate how far back in evolutionary time the organisms diverged from each other.”

**No**.

Actually, even for closely related genomes edit distance is not the relevant statistic in biology, but rather the *alignment* of the genomes. This is why biological sequence analysis is based on what is called the Neeldeman-Wunsch algorithm and not the Wagner-Fischer algorithm mentioned in the MIT press release. Actually, because the length of sequences for which it is relevant to compute alignments is bounded and rather small (usually a few tens or hundreds of thousands of base pairs at most), the real computational issue for biological sequence analysis is not running time but memory. The Needleman-Wunsch and Wagner-Fischer algorithms require *space* and the algorithmic advance that made edit distance and alignment computations for biological sequence comparison tractable was Hirschberg’s algorithm which requires only linear space at a cost of only a factor of 2 in time.

Thus, the parenthetical comment of Backurs and Indyk, intended to merely draw attention of the reader to the possible practical relevance of the edit distance problem, mutated into utter rubbish by the time the work was being discussed in the Boston Globe. If that was the only issue with the brainiac column I might have just hurt my hand when I facepalmed, but the following paragraph caused me the real pain:

Because it’s computationally infeasible to compute the edit distance between two human genomes, biologists have to rely on approximations. They’d love a faster method, but Indyk and his coauthor, MIT graduate student Arturs Backurs, recently demonstrated a faster method is impossible to create. And by impossible they don’t mean “very hard” or “our technology has to improve first.” They mean something like, “by the laws of the mathematics, it can’t be done.”

We’ve already gone through point #1 but amazingly *nothing* is true in this paragraph:

- No, it is not computationally infeasible to compute the edit distance between two human genomes and no, biologists do not have to rely on approximations. But then there is also the fact that…
- Backurs and Indyk did
*not,*in fact, demonstrate that a “faster method is impossible to create”. What they showed is that reducing the exponent in the quadratic algorithm would require SETH to be false. It turns out there are some respectable computer scientists that believe that SETH is indeed false. So… - No, Backurs and Indyk, who understand mathematics quite well… most certainly do
*not*believe that “by the laws of the mathematics, it can’t be done”. Actually, - I’d bet that Backurs and Indyk don’t really believe that mathematics has laws. Axioms yes.. but laws…

I find inaccurate reporting of science to be highly problematic. How can we expect the general public to believe scientists claims about serious issues, e.g. the prospects for climate change, when news reports are filled with hype and rubbish about less serious issues, e.g. the complexity of edit distance?

There is a another issue that this example highlights. While complexity theory is extremely important and fundamental for computer science, and while a lot of the work in complexity has practical applications, there has been a consistent hyping of implications for biology where complexity theory is simply not very* *relevant. When I started my career it was fashionable to talk about protein folding being NP-complete. But of course millions of proteins were probably being folded in my body while I was writing this blog post. Does that fact that I am alive imply that P=NP?

Of course, exploring the complexity of simplified models for biological systems, such as the HP-model for protein folding that the above reference links to, has great value. It helps to sharpen the questions, to better specify models, and to define the limits of computational tractability for computational biology problems. But it is worth keeping in mind that in biology *n* does not go to infinity.

## 4 comments

Comments feed for this article

August 15, 2015 at 10:18 am

Piotr IndykDear Lior,

I read your analysis of the Boston Globe article. Here are my thoughts on this matter.

First, I agree with most of the observations about the article. There are several parts of it, including the title and the paragraph highlighted in your post, that I would have suggested corrections to if I was asked to fact-check them. Notably, the difference between unconditional and conditional impossibility results is hugely important, but this distinction is mentioned only towards the end of the article. In a similar vein, the distinction between the running time of an algorithm for a problem, the computational complexity of the problem, and the running time needed to solve the problem on a particular input has been blurred. Etc. (I should also note that I helped fact-check some other parts of the article, and I am comfortable with those.)

At the same time though, I have to say that I appreciate the difficulty that science writers face when trying to explain complex concepts to the general public. This is particularly the case when an article covers non-trivial concepts from multiple fields, in this case algorithms, computational complexity and biology. Simplification is a necessity, and the risk of oversimplification, inaccuracy or incorrectness is high. For example, as much as it would be preferable for the text to be non-ambiguous, a sentence along the lines of “Edit distance cannot be solved in truly sub-quadratic time in the worst case unless SETH is false” would likely be incomprehensible to the general public.

In addition to the above “editorial” comments, I also have a scientific one. Specifically: the problem of computing the minimum edit distance between two sequences (solved by the Wagner-Fischer algorithm) is in fact a special case of the maximum score alignment problem (solved by the Needleman-Wunsch algorithm you mention). This is because minimizing the number of insertions, deletions and substitutions is equivalent to the problem of maximizing [-S-X] where X is the number of mismatches and S is the number of spaces (using the notation from your earlier post on this topic).

Therefore, our quadratic hardness result for the edit distance problem directly implies quadratic hardness for the alignment problem solved by the Needleman-Wunsch algorithm, for weights s=x=-1. The problem has since been shown to be hard for other choices of weights as well, as per the references [ABW15, BK15] in our paper.

In sum, even though an *algorithm* for the edit distance cannot be used to solve the the alignment problem as defined above, a *hardness result* for the edit distance (which is what we show in the paper) directly implies the hardness of the alignment problem.

Best,

Piotr

August 15, 2015 at 12:37 pm

PerryThe larger question is, ‘why even bother reporting technical news’? In stories reporting on fields I have knowledge in, I nearly always find serious errors of fact. Folks who are interested in news like this have better ways to read it, from trusted sources. My guess is that news organizations like the Globe want to appear ‘smart’ – hence the ‘brainiac’ byline – in order to bolster their credibility with their shrinking audience.

I work in Boston, and not a day goes by that there’s not a face-palm-inducing story in our local press.

August 15, 2015 at 6:30 pm

Jonathan BadgerWhen I started my career it was fashionable to talk about protein folding being NP-complete. But of course millions of proteins were probably being folded in my body while I was writing this blog post. Does that fact that I am alive imply that P=NP?No, because no algorithms are involved in actual biological protein folding as opposed to the in-silico sequence-to-structure computational problem. I agree with you that concerns over computational complexity aren’t always important in practice because, as you put it “n does not go to infinity”, but that’s a different issue entirely.

February 2, 2016 at 3:44 pm

Moshe VardiSee also, “The Moral Hazard of Complexity-Theoretic Assumptions”, at http://cacm.acm.org/magazines/2016/2/197412-the-moral-hazard-of-complexity-theoretic-assumptions/fulltext