You are currently browsing the tag archive for the ‘alignment’ tag.

In 1963, a post-doctoral fellow at Caltech by the name of Emile Zuckerkandl, who was working with Linus Pauling, published a landmark paper that launched the field of comparative genomics. The full citation is:

L. Pauling and E. Zuckerkandl, “Chemical Paleogenetics: Molecular ‘restoration studies’ of extinct forms of life“, Acta Chem. Scand. 17 (1963) , S9–S16.

**The paper not only introduces the concept of comparative genomics, but also its generalization to phylogenomics. **It followed on the heels of pioneering work by Zuckerkandl, following the advice of Pauling, to sequence hemoglobin proteins in various primates and mammals. In related work, Zuckerkandl and Pauling (it is clear that Zuckerkandl was the creative and driving force) also introduced the concept of a molecular clock and proposed the concept of sequence alignment (although Ingram published first).

Zuckerkandl’s name and work are not as well known as they ought to be. Perhaps this is partly because his work was overshadowed by Watson and Crick’s famous paper published a decade earlier, but maybe also because computational methods take second place to experiments and Zuckerkandl’s legacy lies in his ideas about phylogenomics, not his contribution to protein sequencing.

The paper after which this blog post is titled is extraordinary in its originality, and the main figure in the paper is a classic that every computational biologist should instantly recognize:

The caption (from the paper), is as follows: Tentative partial structure of two chain-ancestors of the human hemoglobin polypeptide chains. The numbering of the residues is the one usually applied to the human -chain. The abbreviations for the amino acids are those commonly employed, except for asparagin (asg) and glutamine (glm). The other abbreviations: E=early epoch; M=medium epoch; L=late epoch; abs=absent. “None”, in column (e), means: probably no evolutionarily effective mutation occurred at the site under consideration in the line of descent leading from the ancestral genes to the human genes. The residues and comments are placed in parentheses when the conclusion reached is partly based on the consideration of human and sperm whale myglobins.

(a) Residue number

(b) Partial sequence of the -chain (late form)

(c) Partial sequence of the -chain (late form)

(d) Chain(s) in whose direct ancestry the mutation(s) seem to have occurred

(e) Nature of the substitution

(f) Quantitative evaluation of the time of mutation

(g) From evidence relating to a number of different animals (cf. Gratzer and Allison)

(h) Not after the time of the ancestor common to horse and man

(j) Amino-acid residue x present at a homologous site in the two myoglobins

(k) Polypeptide chain or

The inference of ancestry is based on the hemoglobin gene phylogeny

The rows (d)–(f) basically tell the story of the paper. Zuckerkandl and Pauling show that with sequence, alignment and phylogeny one can infer the history of mutation and substitution and its effects on extant sequence. Right there, we have the genesis of phylogenomics. In the discussion of the paper there is a detailed description of how phylogenomic techniques can be used to infer functional relationships among proteins, with the conclusion that “even apparently unrelated proteins can indeed have a common molecular ancestor”.

The fields of genomics and phylogenetics have a common ancestor as well but sadly this ancestor, Emile Zuckerkandl, passed away on November 9, 2013. May his ideas continue to evolve.

[**Update Dec. 12]: **Upon reading the initial version of this post, Ewan Birney asked me how the Zuckerkandl-Pauling alignment looks with modern tools, an interesting question now that 50 years have passed since their paper was published**. **I decided to look at this and started by downloaded the sequences and aligning them with three programs: 1) Clustal-Omega, the latest (last) version of the popular and widely used Clustal alignment programs, 2) Muscle, a sequence alignment tool billed by its author as “one of the best-performing multiple alignment programs” and 3) FSA, a sequence alignment program developed in my group by Robert Bradley. The results are interesting:

The Clustal-Omega alignment.

The muscle alignment.

It’s a bit complicated to compare these multiple alignments with the Z-P alignment. The latter is a pairwise alignment of the two ancestrally inferred sequences, whereas the multiple alignments above are of the extant sequences. This is already interesting; almost no alignment programs explicitly align by ancestral inference the way Z-P were doing it. The only exception, as far as I know, is MAVID, written by Nicolas Bray, although as implemented it was only suitable for DNA alignment. Nevertheless, the alignment is small enough that I was able to compare Z-P’s alignment to the Clustal-Omega, Muscle and FSA alignments by hand.

There are three areas of the alignment with indels. The first is right in the beginning, where Z-P annotate an indel right after the Valine (V). Both Clustal-Omega and Muscle moved this indel to the beginning of the sequence. That is probably because of heuristics setting gaps to the ends of sequences in case they are incomplete. In this case, I removed the Methionine because Z-P had not included it in their alignment. Nevertheless, FSA aligned this correctly (I think the Z-P alignment here is correct).

The second indel is two bases and different between all three programs and the Z-P alignment. Its a bit hard to say who is correct. One would have to look at more sequences, but I think that FSA looks good here (based on the posterior probabilities shown, see below).

The third indel involves the placement of the amino acids DLSH. In this case only muscle agrees with the Z-P alignment, however I think the FSA answer is the most informative (and only informative) answer in this case. FSA not only shows a single multiple alignment, but also colors it according to the posterior probabilities of the individual pairwise alignments. This is explained in the paper— the visualization is interactive and available via a java applet. What the FSA alignment shows is that there are really two solutions of almost equal accuracy: the one it produced and the Z-P solution. The latter requires two separate indel events. These are presumably much less common than mutations, so it’s not clear which is correct (this could be checked by examining a larger globin alignment including outlier species).

In summary, what the alignments with modern tools show, is that multiple alignment is really hard. Much harder than most people think!

A final analysis I did is to look at the tree inferred according to an evolutionary model (the FSA alignment server does this automatically). The inferred tree is unrooted and shown below:

It agrees with the Z-P tree, except for the fact that the beta-delta ancestor is much closer than what is shown in their figure. It should be noted that at the time of their paper, (Markov) models for amino acid changes along a tree had not yet been developed; that only started happening with Jukes and Cantor in 1967.

Overall, Zuckerkandl and Pauling did a remarkable job with their alignment, ancestral inference and tree estimation. What I learned from my analysis is that their paper still provides a case study for multiple alignment today.

I recently completed a term as director of our Center for Computational Biology and one of the things I am proud of having accomplished is helping Brian McClendon, program administrator for the Center, launch a new Ph.D. program in computational biology at UC Berkeley.

The Ph.D. program includes a requirement for taking a new course, “Classics in Computational Biology” (CMPBIO 201), that introduces students to research projects and approaches in computational biology via a survey of classic papers. I taught the class last week and the classic paper I chose was “Comparison of biosequences” by Temple Smith and Michael Waterman, Advances in Applied Mathematics **2 **(1981), p 482–489. I gave a preparatory lecture this past week in which I discussed the Needleman-Wunsch algorithm, followed by leading a class discussion about the Smith-Waterman paper and related topics. This post is an approximate transcription of my lecture on the Needleman-Wunsch algorithm:

The Needleman-Wunsch algorithm was published in “A general method applicable to the search for similarities in the amino acid sequence of two proteins“, Needleman and Wunsch, Journal of Molecular Biology **48** (1970), p 443–453. As with neighbor-joining or UniFrac that I just discussed in my previous blog post, the Needleman-Wunsch algorithm was published without an explanation of its meaning, or even a precise description of the dynamic programming procedure it is based on. In their paper 11 years later, Smith and Waterman write

“In 1970 Needleman and Wunsch [1] introduced their homology (similarity) algorithm. From a mathematical viewpoint, their work lacks rigor and clarity. But their algorithm has become widely used by the biological community for sequence comparisons.”

They go on to state that “The statement in Needleman-Wunsch [1] is less general, unclearly stated, and does not have a proof” (in comparison to their own main Theorem 1). The Theorem Smith and Waterman are referring to explicitly describes the dynamic programming recursion only implicit in Needleman-Wunsch, and then explains why it produces the “optimal” alignment. In hindsight, the Needleman-Wunsch algorithm can be described formally as follows:

Given two sequences *a *and *b *of lengths and respectively, an *alignment* * *is a sequence of pairs:

.

Given an alignment , we let

and

.

In other words, *M* counts the number of matching characters in the alignment, *X* counts the number of mismatches and *S* the number of “spaces”, i.e. characters not aligned to another character in the other sequence. Since every character must be either matched, mismatched, or a not aligned, we have that

.

Given scores (parameters) consisting of real numbers , the Needleman-Wunsch algorithm finds an alignment that maximizes the function . Note that by the linear relationship between *M,X* and *S *described above, there are really only *two free parameters,* and without loss of generality we can assume they are *s *and *x*.* *Furthermore, only *S *and *X *are relevant for computing the score of the alignment; when the scores are provided statistical meaning, one can say that *S *and* X* are sufficient statistics. We call them the *summary* of the alignment.

The specific procedure of the Needleman-Wunsch algorithm is to compute a matrix *S* recursively:

where is the indicator function that is 1 if the characters are the same, and zero otherwise, and an initialization step consists of setting and for all *i,j*. There are numerous generalizations and improvements to the basic algorithm, both in extending the scoring function to include more parameters (although most generalizations retain the linearity assumption), and algorithms for trading off time for space improvements (e.g. divide-and-conquer can be used to improve the space requirements from to ).

An important advance in understanding the Needleman-Wunsch algorithm was the realization that the “scores” *m,x* and *s* can be interpreted as logarithms of probabilities if the sign on the parameters is the same, or as log-odds ratios if the signs are mixed. This is discussed in detail in the book “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Sequences” by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison.

One insight from the statistical point of view is that if *max* and *plus* in the recursion are replaced with *plus *and *times*, and the logarithms of probabilities are replaced by the probabilities themselves, then the Needleman-Wunsch recursion computes the total probability of the pairs of sequences marginalized over alignments (i.e., it is the forward algorithm for a suitably defined hidden Markov model). Together with the backward algorithm, which consists of running the forward algorithm on the reversed sequences, this provides an efficient way to compute for every pair of nucleotides *the posterior probability that they are aligned. *In other words, there is a meaningful interpretation and useful application for the Needleman-Wunsch algorithm with the semi-ring replaced with .

In work in collaboration with Bernd Sturmfels, we realized that it also makes sense to run the Needleman-Wunsch algorithm with the polytope algebra . Here is the set of convex polytopes in for some *d, *polytopes are “added” using the geometric operation of the convex hull of the union of polytopes, and they are “multiplied” by taking Minkowski sum. The result allows one to find not only a single optimal alignment, but *all* optimal alignments. The details are discussed in a pair of papers:

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138–16143.

C. Dewey, P. Huggins, K, Woods, B. Sturmfels and L. Pachter, Parametric alignment of Drosophila genomes, PLoS Computational Biology, Volume 2, Number 6 (2006) p e73.

The first explains the mathematical foundations for using the result and is where we coined the term *polytope propagation algorithm* for the Needleman-Wunsch algorithm with the polytope algebra. The second paper illustrates an application to a whole-genome alignment of two Drosophilids and provides an alternative to polytope propagation (the beneath-beyond algorithm) that is much faster. The type of analysis performed in the paper is also called *parametric alignment*, a problem with a long history whose popularity was launched by Dan Gusfield who provided the first software program for parametric alignment called XPARAL.

The polytope propagation algorithm (Needleman-Wunsch with the polytope algebra) is illustrated in the figure below:

The convex polygon in the bottom right panel contains in its interior points corresponding to all possible summaries for the alignments of the two sequences. The optimal summaries are vertices of the polygon. For each vertex, the parameters for which the Needleman-Wunsch algorithm will produce a summary corresponding to that vertex form a cone. The cones together form a fan which is shown inside the polygon.

The running time of polytope propagation depends on the number of vertices in the final polytope. In our papers we provide bounds showing that polytope propagation is extremely efficient. In particular, for *d *parameters, the number of vertices is at most . Cynthia Vinzant, formerly in our math department at UC Berkeley, has written a nice paper on lower bounds.

The fact that the Needleman-Wunsch algorithm can be run with three semi-rings means that it is particularly well suited to C++ thanks to templates that allow for the variables in the recursion to be abstracted. This idea (and code) is thanks to my former student Colin Dewey:

`template<typename SemiRing>`

```
void
alignGlobalLastRow(const string& seq1,
const string& seq2,
const typename SemiRing::Element match,
const typename SemiRing::Element mismatch,
const typename SemiRing::Element gap,
vector& row);
const Element one = SemiRing::multiplicativeIdentity;
// Initialize row
row.resize(seq2.size() + 1);
row[0] = one;
// Calculate first row
for (size_t j = 1; j <= seq2.size(); ++j)
row[j] = gap * row[j - 1];
// Calculate remaining rows
Element up, diag;
for (size_t i = 1; i <= seq1.size(); ++i) {
diag = row[0];
row[0] *= gap;
for (size_t j = 1; j <= seq2.size(); ++j) {
up = row[j];
if (seq1[i - 1] == seq2[j - 1]) {
row[j] = match * diag + gap * (up + row[j - 1]);
} else {
row[j] = mismatch * diag + gap * (up + row[j - 1]);
}
diag = up;
}
}
```

To recap, the three semi-rings with which it makes sense to run this algorithm are:

The interpretation of the output of the algorithm with the three semi-rings is:

- is the score (probability) of a maximum alignment between the
*i*th prefix of one sequence and the*j*th prefix of the other. - is the total score (probability) of all alignments between the
*i*th prefix of one sequence and the*j*th prefix of the other. - is a polytope whose faces correspond to summaries that are optimal for some set of parameters.

The semi-rings in (2) and (3) lead to particularly interesting and useful applications of the Needleman-Wunsch algorithm. An example of the use of (2), is in *posterior* *decoding* based alignment where nucleotides are aligned as to maximize the sum of the posterior probabilities computed by (2). In the paper “Alignment Metric Accuracy” (with Ariel Schwartz and Gene Myers) we provide an interpretation in terms of an interesting metric on sequence alignments and in

A.S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics 23 (2007), e24–e29

we show how to adapt posterior decoding into a (greedy) multiple alignment algorithm. The video below shows how it works, with each step consisting of aligning a single pair of nucleotides from some pair of sequences using posterior probabilities computed using the Needleman-Wunsch algorithm with semi-ring (2):

We later extended this idea in work together with my former student Robert Bradley in what became FSA published in the paper “Fast Statistical Alignment“, R. Bradley et al., PLoS Computational Biology **5** (2009), e1000392.

An interesting example of the application of Needleman-Wunsch with semi-ring (3), i.e. polytope propagation, is provided in the recent paper “The RNA Newton polytope and learnability of energy parameters” by Elmirasadat Forouzmand and Hamidreza Chitsaz, Bioinformatics **29** (2013), i300–i307. Using our polytope propagation algorithm, they investigate whether simple models for RNA folding can produce, for some set of parameters, the folds of sequences with solved structures (the answer is sometimes, but not always).

It is remarkable that forty three years after the publication of the Needleman-Wunsch paper, and thirty two years after the publication of the Smith-Waterman paper, the algorithms remain pertinent, widely used, and continue to reveal new secrets. True classics.

Where are the authors now?

Saul B. Needleman writes about numismatics.

Christian Wunsch is a clinical pathologist in the University of Miami health system.

Temple Smith is Professor Emeritus of Biomedical Engineering at Boston University.

Michael Waterman is Professor of Biological Sciences, Mathematics and Computer Science at the University of Southern California.

## Recent Comments