I recently completed a term as director of our Center for Computational Biology and one of the things I am proud of having accomplished is helping Brian McClendon, program administrator for the Center, launch a new Ph.D. program in computational biology at UC Berkeley.

The Ph.D. program includes a requirement for taking a new course, “Classics in Computational Biology” (CMPBIO 201), that introduces students to research projects and approaches in computational biology via a survey of classic papers. I taught the class last week and the  classic paper I chose was  “Comparison of biosequences” by Temple Smith and Michael Waterman, Advances in Applied Mathematics 2 (1981), p 482–489.  I gave a preparatory lecture this past week in which I discussed the Needleman-Wunsch algorithm, followed by leading a class discussion about the Smith-Waterman paper and related topics. This post is an approximate transcription of my lecture on the Needleman-Wunsch algorithm:

The Needleman-Wunsch algorithm was published in “A general method applicable to the search for similarities in the amino acid sequence of two proteins“, Needleman and Wunsch, Journal of Molecular Biology 48 (1970), p 443–453. As with neighbor-joining or UniFrac that I just discussed in my previous blog post, the Needleman-Wunsch algorithm was published without an explanation of its meaning, or even a precise description of the dynamic programming procedure it is based on. In their paper 11 years later, Smith and Waterman write

“In 1970 Needleman and Wunsch  introduced their homology (similarity) algorithm. From a mathematical viewpoint, their work lacks rigor and clarity. But their algorithm has become widely used by the biological community for sequence comparisons.”

They go on to state that “The statement in Needleman-Wunsch  is less general, unclearly stated, and does not have a proof” (in comparison to their own main Theorem 1). The Theorem Smith and Waterman are referring to explicitly describes  the dynamic programming recursion only implicit in Needleman-Wunsch, and then explains why it produces the “optimal” alignment. In hindsight, the Needleman-Wunsch algorithm can be described formally as follows:

Given two sequences and b of lengths $n_1$ and $n_2$ respectively, an alignment $A(a,b)$ is a sequence of pairs: $[(a_{i_1},b_{j_1}),(a_{i_2},b_{j_2}),\ldots : 1 \leq i_1 < i_2 < \cdots \leq n_1, 1 \leq j_1 < j_2 \cdots \leq n_2 ]$.

Given an alignment $A(a,b)$, we let $M = | \{ k: a_{i_k} = b_{j_k} \} |, X = | \{ k: a_{i_k} \neq b_{j_k} \} |$

and $S = | \{ r:a_r \neq i_k \, \mbox{for any k} \} | + | \{ r:b_r \neq j_k \, \mbox{for any k} \} |$.

In other words, M counts the number of matching characters in the alignment, X counts the number of mismatches and S the number of “spaces”, i.e. characters not aligned to another character in the other sequence. Since every character must be either matched, mismatched, or a not aligned, we have that $2 M + 2 X + S = n_1+n_2$.

Given scores (parameters) consisting of real numbers $m,x,s$, the Needleman-Wunsch algorithm finds an alignment that maximizes the function $m \cdot M + x \cdot X + s \cdot S$. Note that by the linear relationship between M,X and described above, there are really only two free parameters, and without loss of generality we can assume they are and x. Furthermore, only and are relevant for computing the score of the alignment; when the scores are provided statistical meaning, one can say that and X are sufficient statistics. We call them the summary of the alignment.

The specific procedure of the Needleman-Wunsch algorithm is to compute a matrix S recursively: $S(i,j) = \mbox{max} \{ S_{i-1,j} + s,S_{i,j-1} + s, S_{i-1,j-1} + xI(a_{i} = b_{j} )\}$

where $I(a_i = b_j)$ is the indicator function that is 1 if the characters are the same, and zero otherwise, and an initialization step consists of setting $S(i,0)=s \cdot i$ and $S(0,j) = s \cdot j$ for all i,j. There are numerous generalizations and improvements to the basic algorithm, both in extending the scoring function to include more parameters (although most generalizations retain the linearity assumption), and algorithms for trading off time for space improvements (e.g. divide-and-conquer can be used to improve the space requirements from $O(n_1 n_2)$ to $O(n_1 + n_2)$).

An important advance in understanding the Needleman-Wunsch algorithm was the realization that the “scores” m,x and s can be interpreted as logarithms of probabilities if the sign on the parameters is the same, or as log-odds ratios if the signs are mixed. This is discussed in detail in the book “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Sequences” by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison.

One insight from the statistical point of view is that if  max and plus in the recursion are replaced with plus and times, and the logarithms of probabilities are replaced by the probabilities themselves, then the Needleman-Wunsch recursion computes the total probability of the pairs of sequences marginalized over alignments (i.e., it is the forward algorithm for a suitably defined hidden Markov model). Together with the backward algorithm, which consists of running the forward algorithm on the reversed sequences, this provides an efficient way to compute for every pair of nucleotides the posterior probability that they are aligned. In other words, there is a meaningful interpretation and useful application for the Needleman-Wunsch algorithm with the semi-ring $(\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +)$ replaced with $(\mathbb{R}, \oplus := +, \otimes := \times)$.

In work in collaboration with Bernd Sturmfels, we realized that it also makes sense to run the Needleman-Wunsch algorithm with the  polytope algebra $(\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum})$. Here $\mathcal{P}$ is  the set of convex polytopes in $\mathbb{R}^d$ for some d, polytopes are “added” using the geometric operation of the convex hull of the union of polytopes, and they are “multiplied” by taking Minkowski sum. The result allows one to find not only a single optimal alignment, but all optimal alignments.  The details are discussed in a pair of papers:

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138–16143.

C. Dewey, P. Huggins, K, Woods, B. Sturmfels and L. Pachter, Parametric alignment of Drosophila genomes, PLoS Computational Biology, Volume 2, Number 6 (2006) p e73.

The first explains the mathematical foundations for using the result and is where we coined the term polytope propagation algorithm for the Needleman-Wunsch algorithm with the polytope algebra. The second paper illustrates an application to a whole-genome alignment of two Drosophilids and provides an alternative to polytope propagation (the beneath-beyond algorithm) that is much faster. The type of analysis performed in the paper is also called parametric alignment, a problem with a long history whose popularity was launched by Dan Gusfield who provided the first software program for parametric alignment called XPARAL.

The polytope propagation algorithm (Needleman-Wunsch with the polytope algebra) is illustrated in the figure below: The convex polygon in the bottom right panel contains in its interior points corresponding to all possible summaries for the alignments of the two sequences. The optimal summaries are vertices of the polygon. For each vertex, the parameters for which the Needleman-Wunsch algorithm will produce a summary corresponding to that vertex form a cone. The cones together form a fan which is shown inside the polygon.

The running time of polytope propagation depends on the number of vertices in the final polytope. In our papers we provide bounds showing that polytope propagation is extremely efficient. In particular, for parameters, the number of vertices is at most $O(n^{\frac{d(d-1)}{(d+1)}})$Cynthia Vinzant, formerly in our math department at UC Berkeley, has written a nice paper on lower bounds.

The fact that the Needleman-Wunsch algorithm can be run with three semi-rings means that it is particularly well suited to C++ thanks to templates that allow for the variables in the recursion to be abstracted. This idea (and code) is thanks to my former student Colin Dewey:

template<typename SemiRing>

void
alignGlobalLastRow(const string& seq1,
const string& seq2,
const typename SemiRing::Element match,
const typename SemiRing::Element mismatch,
const typename SemiRing::Element gap,
vector& row);
const Element one = SemiRing::multiplicativeIdentity;
// Initialize row
row.resize(seq2.size() + 1);
row = one;
// Calculate first row
for (size_t j = 1; j <= seq2.size(); ++j)
row[j] = gap * row[j - 1];
// Calculate remaining rows
Element up, diag;
for (size_t i = 1; i <= seq1.size(); ++i) {
diag = row;
row *= gap;
for (size_t j = 1; j <= seq2.size(); ++j) {
up = row[j];
if (seq1[i - 1] == seq2[j - 1]) {
row[j] = match * diag + gap * (up + row[j - 1]);
} else {
row[j] = mismatch * diag + gap * (up + row[j - 1]);
}
diag = up;
}
}

To recap, the three semi-rings with which it makes sense to run this algorithm are:

1. $(\mathbb{R}, \oplus := +, \otimes := \times)$
2. $(\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +)$
3. $(\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum})$

The interpretation of the output of the algorithm with the three semi-rings is:

1. $S_{ij}$ is the score (probability) of a maximum alignment between the ith prefix of one sequence and the jth prefix of the other.
2. $S_{ij}$ is the total score (probability) of all alignments between the ith prefix of one sequence and the jth prefix of the other.
3. $S_{ij}$ is a polytope whose faces correspond to summaries that are optimal for some set of parameters.

The semi-rings in (2) and (3) lead to particularly interesting and useful applications of the Needleman-Wunsch algorithm. An example of the use of (2), is in posterior decoding based alignment where nucleotides are aligned as to maximize the sum of the posterior probabilities computed by (2). In the paper “Alignment Metric Accuracy” (with Ariel Schwartz and Gene Myers) we  provide an interpretation in terms of an interesting metric on sequence alignments and in

A.S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics 23 (2007), e24–e29

we show how to adapt posterior decoding into a (greedy) multiple alignment algorithm. The video below shows how it works, with each step consisting of aligning a single pair of nucleotides from some pair of sequences using posterior probabilities computed using the Needleman-Wunsch algorithm with semi-ring (2):

We later extended this idea in work together with my former student Robert Bradley in what became FSA published in the paper “Fast Statistical Alignment“, R. Bradley et al., PLoS Computational Biology 5 (2009), e1000392.

An interesting example of the application of Needleman-Wunsch with semi-ring (3), i.e. polytope propagation, is provided in the recent paper “The RNA Newton polytope and learnability of energy parameters” by Elmirasadat Forouzmand and Hamidreza Chitsaz, Bioinformatics 29 (2013), i300–i307. Using our polytope propagation algorithm, they investigate whether simple models for RNA folding can produce, for some set of parameters, the folds of sequences with solved structures (the answer is sometimes, but not always).

It is remarkable that forty three years after the publication of the Needleman-Wunsch paper, and thirty two years after the publication of the Smith-Waterman paper, the algorithms remain pertinent, widely used, and continue to reveal new secrets. True classics.

Where are the authors now?

Saul B. Needleman writes about numismatics.

Christian Wunsch is a clinical pathologist in the University of Miami health system.

Temple Smith is Professor Emeritus of Biomedical Engineering at Boston University.

Michael Waterman is Professor of Biological Sciences, Mathematics and Computer Science at the University of Southern California.