In 1963, a post-doctoral fellow at Caltech by the name of Emile Zuckerkandl, who was working with Linus Pauling, published a seminal paper that launched the field of comparative genomics. The full citation is:

L. Pauling and E. Zuckerkandl, “Chemical Paleogenetics: Molecular ‘restoration studies’ of extinct forms of life“, Acta Chem. Scand. 17 (1963) , S9–S16.

The paper not only introduces the concept of comparative genomics, but also its generalization to phylogenomics. It followed on the heels of pioneering work by Zuckerkandl, following the advice of Pauling, to sequence hemoglobin proteins in various primates and mammals. In related work, Zuckerkandl and Pauling (it is clear that Zuckerkandl was the creative and driving force) also introduced the concept of a molecular clock and proposed the concept of sequence alignment (although Ingram published first).

Zuckerkandl’s name and work are not as well known as they ought to be. Perhaps this is partly because his work was overshadowed by Watson and Crick’s famous paper published a decade earlier, but maybe also because computational methods take second place to experiments and Zuckerkandl’s legacy lies in his ideas about phylogenomics, not his contribution to protein sequencing.

The paper after which this blog post is titled is extraordinary in its originality, and the main figure in the paper is a classic that every computational biologist should instantly recognize:

The caption (from the paper), is as follows: Tentative partial structure of two chain-ancestors of the human hemoglobin polypeptide chains. The numbering of the residues is the one usually applied to the human $\alpha$-chain. The abbreviations for the amino acids are those commonly employed, except for asparagin (asg) and glutamine (glm). The other abbreviations: E=early epoch; M=medium epoch; L=late epoch; abs=absent. “None”, in column (e), means: probably no evolutionarily effective mutation occurred at the site under consideration in the line of descent leading from the ancestral genes to the human genes. The residues and comments are placed in parentheses when the conclusion reached is partly based on the consideration of human and sperm whale myglobins.

(a) Residue number

(b) Partial sequence of the $II^{\beta}--III^{\gamma}--IV^{\delta}$-chain (late form)

(c) Partial sequence of the $I-^{\alpha} II^{\beta}--III^{\gamma}--IV^{\delta}$-chain (late form)

(d) Chain(s) in whose direct ancestry the mutation(s) seem to have occurred

(e) Nature of the substitution

(f) Quantitative evaluation of the time of mutation

(g) From evidence relating to a number of different animals (cf. Gratzer and Allison)

(h) Not after the time of the ancestor common to horse and man

(j) Amino-acid residue x present at a homologous site in the two myoglobins

(k) Polypeptide chain $I^{\alpha}$ or $II^{\beta}--III^{\gamma}--IV^{\delta}$

The inference of ancestry is based on the hemoglobin gene phylogeny

The rows (d)–(f) basically tell the story of the paper. Zuckerkandl and Pauling show that with sequence, alignment and phylogeny one can infer the history of mutation and substitution and its effects on extant sequence. Right there, we have the genesis of phylogenomics. In the discussion of the paper there is a detailed description of how phylogenomic techniques can be used to infer functional relationships among proteins, with the conclusion that “even apparently unrelated proteins can indeed have a common molecular ancestor”.

The fields of genomics and phylogenetics have a common ancestor as well but sadly this ancestor, Emile Zuckerkandl, passed away on November 9, 2013. May his ideas continue to evolve.

[Update Dec. 12]: Upon reading the initial version of this post, Ewan Birney asked me how the Zuckerkandl-Pauling alignment looks with modern tools, an interesting question now that 50 years have passed since their paper was publishedI decided to look at this and started by downloaded the sequences and aligning them with three programs: 1) Clustal-Omega, the latest (last) version of the popular and widely used Clustal alignment programs, 2) Muscle, a sequence alignment tool billed by its author as “one of the best-performing multiple alignment programs” and 3) FSA, a sequence alignment program developed in my group by Robert Bradley. The results are interesting:

The Clustal-Omega alignment.

The muscle alignment.

The FSA alignment.

Its a bit complicated to compare these multiple alignments with the Z-P alignment. The latter is a pairwise alignment of the two ancestrally inferred sequences, whereas the multiple alignments above are of the extant sequences. This is already interesting; almost no alignment programs explicitly align by ancestral inference the way Z-P were doing it. The only exception, as far as I know, is MAVID, written by Nicolas Bray, although as implemented it was only suitable for DNA alignment. Nevertheless, the alignment is small enough that I was able to compare Z-P’s alignment to the Clustal-Omega, Muscle and FSA alignments by hand.

There are three areas of the alignment with indels. The first is right in the beginning, where Z-P annotate an indel right after the Valine (V). Both Clustal-Omega and Muscle moved this indel to the beginning of the sequence. That is probably because of heuristics setting gaps to the ends of sequences in case they are incomplete. In this case, I removed the Methionine because Z-P had not included it in their alignment. Nevertheless, FSA aligned this correctly (I think the Z-P alignment here is correct).

The second indel is two bases and different between all three programs and the Z-P alignment. Its a bit hard to say who is correct. One would have to look at more sequences, but I think that FSA looks good here (based on the posterior probabilities shown, see below).

The third indel involves the placement of the amino acids DLSH. In this case only muscle agrees with the Z-P alignment, however I think the FSA answer is the most informative (and only informative) answer in this case. FSA not only shows a single multiple alignment, but also colors it according to the posterior probabilities of the individual pairwise alignments. This is explained in the paper— the visualization is interactive and available via a java applet. What the FSA alignment shows is that there are really two solutions of almost equal accuracy: the one it produced and the Z-P solution. The latter requires two separate indel events. These are presumably much less common than mutations, so its not clear which is correct (this could be checked by examining a larger globin alignment including outlier species).

In summary, what the alignments with modern tools show, is that the multiple alignment is really hard. Much harder than most people think!

A final analysis I did is to look at the tree inferred according to an evolutionary model (the FSA alignment server does this automatically). The inferred tree is unrooted and shown below:

It agrees with the Z-P tree, except for the fact that the beta-delta ancestor is much closer than what is shown in their figure. It should be noted that at the time of their paper, (Markov) models for amino acid changes along a tree had not yet been developed; that only started happening with Jukes and Cantor in 1967.

Overall, Zuckerkandl and Pauling did a remarkable job with their alignment, ancestral inference and tree estimation. What I learned from my analysis is that their paper still provides a case study for multiple alignment today.