You are currently browsing the monthly archive for December 2013.

The paper “Genomic-scale capture and sequencing of endogenous DNA from feces” by George H. Perry, John C. Marioni, Páll Melsted and Yoav Gilad is literally full of feces. The word ‘fecal’ appears 100 times.

Poop jokes aside, the paper presents an interesting idea that has legs. Perry et al. show that clever use of Agilent’s SureSelect allows for capturing nuclear genomic regions from fecal DNA. Intellectually, it is the predecessor of T. Mercer et al.‘s “Targeted RNA sequencing reveals deep complexity of the human transcriptome“, Nature Biotechnology, 2011 (another paper I like and for which I wrote a research highlight). Even though the Perry et al. paper does not have many citations, the Mercer et al. paper does (although unfortunately the authors forgot to cite Perry et al., which I think they should have). In other words, the Perry et al. paper is not as well known as it ought to be, and this post is an attempt to rectify that.

The ‘*’ in sh*t in my title is for *Seq. At a high level,  the Perry et al. paper shows how high-throughput sequencing technology can be leveraged to sequence deeply a single genome from among a community of metagenomes. For this reason, and for convenience, I henceforth will refer to the Perry et al. paper as the Sh*t-Seq paper. The “*” is inserted in lieu of the “i” not as censorship, but to highlight the point that the method is general and applies not only to sequencing nuclear genome from fecal DNA, but also as Mercer et al. shows, for targeted transcriptome sequencing (one can imagine also many other applications).

The Sh*t-Seq protocol is conceptually simple yet complicated in practice. DNA was captured using the Agilent SureSelect target enrichment system coupled to the Illumina single-end sequencing platform library prep protocol (of note is that the paper is from 2010 and the kits are from 2009). However because of the very small amount of DNA targeted (the authors claim ~1.8%), a number of adjustments to standard SureSelect capture / Illumina library prep had to be implemented. To the authors credit, the paragraphs in the section “DNA Capture” are exemplary in their level of detail and presumably greatly facilitate replicability. I won’t repeat all the detail here. However there are two steps that caught my attention as possibly problematic. First, the the authors performed substantial PCR amplification of the  adapter-ligated fecal DNA. This affects the computational analysis they discuss later and leads to a computational step they implemented that I have some issues with (more on this later). Second, they performed two rounds of capture as one round was insufficient for capturing the needed material for sequencing. This necessitated additional PCR, which also is possibly problematic.

The samples collected were from six chimpanzees. This is a fairly small n but the paper is a proof of principle and I think this is sufficient. Both fecal and blood samples were collected allowing for comparison of the fecally derived nuclear DNA to the actual genome of the primates. In what is clearly an attempt to channel James Bond, they collected fecal samples (2 g of stool) within 1 hour of defecation in tubes containing RNALater and these were then “shaken vigorously” (not stirred).

The next part of the paper is devoted to computational analyses to confirm that the Sh*t-Seq protocol can in fact be used to target nuclear endogenous DNA. As a sanity check, mtDNA was considered first. They noted too much diversity to align reads to a reference genome with BWA, and opted instead for de novo assembly using ABySS. This is certainly overkill, possible only because of the high copy count of mitochondrion reads. But I suppose it worked (after filtering out all the low coverage ABySS sequence, which was presumably junk). One interesting idea given more modern RNA-Seq assembly tools would be to assemble the resulting reads with an RNA-Seq de novo assembler that allows for different abundances of sequences. Ideally, such an assembly should indicate naturally the sought after enrichment.

Next, nuclear DNA was investigated, specifically the X chromosome and chromosome 21. Here the analyses is very pre-2014. First, all multi-mapping reads were removed. This is not a good idea for many reasons, and I am quite certain that with the new GRCh38 (with alternate sequence representation for variant regions) it is a practice that will rapidly be phased out. I’d like to give Perry et al. the benefit of the doubt for making this mistake since they published in 2010, but their paper appeared 6 months after the Cufflinks paper so they could have, in principle, known better. Having said that, while I do think multi-mapping would have allowed them to obtain much stronger results as to the accuracy and extent of their enrichment by avoiding the tossing of a large number of reads, their paper does manage to prove their principle so its not a big deal.

The removal of multi-mapping reads was just the first step in a series of “filters” designed to narrow down the nuclear DNA reads to regions of the chimpanzee genome that could be argued to be unambiguously representative of the target. I won’t go into details, although they are all in the paper. As with the experimental methods, I applaud the authors on publishing reproducible methods, especially computational methods, with all details included. But there was a final red flag for me in the computational methods: namely the selection of a single unique fragment (at random) for each genomic (start) site for the purposes of calling SNPs. This was done to eliminate problems due to amplification biases, which is indeed a serious concern, but if heterozygous sites appear due to the PCR steps, then there ought to have been telltale signatures. For example, a PCR “SNP’ would, I think, appear only in reads specific to a single position, but not in other overlapping reads of the site. It would have been very helpful if they would have done a detailed analysis of this issue, rather than just pick a single read at random for each genomic (start) site. They kicked the can down the road.

Having removed multiple mapping reads, repetitive regions, low coverage regions, etc. etc. Dayenu, they ended up estimating a false positive rate for heterozygous sites (using the X chromosome in males) at 0.0007% for fecal DNA and 0.0010% for blood DNA. This led them to conclude that incorrectly-identified heterozygous sites in their study were 0.8%, 2.0%, 1.1%, and 2.7% for fecal DNA chromosome 21, fecal DNA chromosome X in females, blood DNA chromosome 21, and blood DNA chromosome X in females, respectively. Such good news is certainly the result of their extraordinarily stringent filtering, but I think it does prove that they were able to target effectively. They give further proof using PCR and Sanger sequencing of 20 regions.

I have a final nitpick and it relates to Figure 4. It is a a companion to Figure 3 which shows the Chimpanzee phylogeny for their samples based on the mtDNA. As expected in that figure, the fecal and blood samples cluster together. Figure 4 shows two phylogenies, one based on chr 21, the other on chr X. My issue here is with the way that distances were constructed. Its a technical point, but it looks like they used hamming distance, and I don’t think that makes a lot of sense, not to mention the fact that neighbor-joining does not seem like the appropriate algorithm for building a tree in this setting (I plan to blog about neighbor-joining shortly). But this is a methodological point not really relevant to the main result of the paper, namely proof of principle for targeted sequencing of endogenous DNA from fecal matter.

I think Sh*t-Seq has a future. The idea of targeted capture coupled to high-throughput sequencing has more than an economic rationale. It provides the possibility to probe the “deep field” as discussed in the previously mentioned review on targeted RNA-Seq. This is a general principle that should be more widely recognized.

And of course, dung is just cool. Happy new year!

bart-simpson-generator.phpMy new year ‘s resolution.

In a letter to a German princess on “the twelve tones of the harpsichord” penned by Leonhard Euler on the 3rd of May 1760 (coincidentally exactly 213 years before my birth day (coincidentally 213 is the smallest number that is part of a consecutive triple, each a product of a pair of distinct prime numbers; 213=3×71, 214=2×107, 215=5×43)), the tension between equal temperament and harmony is highlighted as a central issue in music. Euler wrote “It is evident, therefore, that in fact all semitones are not equal, whatever efforts may be made by musicians to render them such; because true harmony resists the execution of a design contrary to its nature”.

What was Euler talking about?

Music is relative. With the exception of a handful of individuals who possess perfect pitch (as an aside, the genetics are interesting), we perceive music relatively. This is very similar to an RNA-Seq experiment. Transcript abundances are measured only relatively to each other (by comparing read counts), although of course a very relatively abundant transcript can also be inferred to be absolutely abundant, as one has prior knowledge about the abundances of housekeeping and other yardstick genes. So it is with music. We know when we are listening to high vs. low pitch, but within a register we hear combinations of notes as concordant or discordant, as melodic or not, based on the relative frequencies of the pitches.

Harmony is the use of simultaneous sounds in music, and is closely associated with the notion of consonance. Specifically, the perfect fifth is the consonant sound of two frequencies at a ratio of 3/2 (the question of why this ratio sounds “good” is an interesting one, and I recommend Helmholtz‘s “On the sensations of tone” as an excellent starting point for exploration). Harmony leads to a “rational” requirement of musical scales: for example, the ability to produce a perfect fifth requires that among notes corresponding to sounds at pre-specified frequencies, there are a pair at a frequency ratio of 3/2.  Other consonant harmonies correspond to other rational numbers with small denominators (e.g. 4/3). Equal temperament is a relative requirement: in a musical scale composed of N notes, it is desirable that any pair of consecutive notes are at the same frequency ratio, say r. This allows for a music to begin on any note.

The tension Euler described between harmony and equal temperament is the mathematical observation that if r^N = 2 and r^{k}=\frac{3}{2}, then k should be chosen so that \frac{k}{N} = log_2 \frac{3}{2}. This is because the first equation leads to Nlog_2 r = log_2 2 = 1 so that log_2 r = \frac{1}{N}, and the second equation then leads to k log_2 r = k \cdot \left( \frac{1}{N} \right) = log_2 \frac{3}{2}. However log_2 \frac{3}{2} is irrational. This is because if it were rational then log_2 3 would be rational, and that would mean that log_2 3 = \frac{a}{b} for some positive integers a,b, which in turn would be mean that 2^a = 3^b, which is impossible because powers of two are always even, and powers of three always odd. Ergo a problem.

The mathematics above shows that harmony is fundamental incompatible with equal temperament, but so what? If a very good rational approximation can be found for log_2 \frac{3}{2} then music might not be perfect, but perfect should not be the enemy of the good. For example, in a 12 note scale,  we find that the seventh note (e.g. G if we are in C major) produces a frequency ratio with the first of 2^{\frac{7}{12}} = 1.4983..., which is awfully close to 1.5. Does the difference in the third decimal really matter when one is listening to Rammstein?

Is 12 special? What about an 11 note scale? The following math provides the answer:

A (simple) continued fraction representation for an irrational number is an infinite expression of the form

x=a_0+\frac{1}{a_1+\frac{1}{a_2+\frac{1}{a_3+\cdots}}}

The convergents of are the rational numbers obtained by truncating the continued fraction expression at successively increasing finite points, i.e. the sequence \frac{a_0}{1}, \frac{a_0a_1+1}{a_1},\frac{a_2(a_1a_0+1)}{a_2a_1+1},\ldots

Theorem: Let r_i=\frac{n_i}{d_i} be the convergents of a number x. Then the rational number r_i is the best rational approximation to x with denominator \leq d_i.

The theorem is saying that if one approximates an irrational number with its continued fractions, then the denominators of those fractions are optimal in the sense that they provide rational approximations that improve on all lower denominators (an elementary proof is written up here). For example, the continued fraction approximation of log_2 \frac{3}{2} is given by the sequence [0,1,1,2,2,3,1,5,2,23,2,2,1,\ldots ] which translates to the expansion

log_2 \frac{3}{2} = 0+\frac{1}{1+\frac{1}{1+\frac{1}{2+\frac{1}{2+ \cdots}}}}

and leads to to the rational approximations 0,1,\frac{1}{2},\frac{3}{5},\frac{7}{12},\frac{24}{41},\frac{21}{53},\ldots

The fourth denominator is 5, and its interpretation in light of the theorem is that no rational number with 1,2,3 or in the denominator can provide an approximation as good as 3/5 to log_2 \frac{3}{2}. Similarly, with 12 being the next denominator, the theorem implies that no rational number with a denominator between and 11 can provide an approximation to log_2 \frac{3}{2} as good as 7/12. In fact, to improve on the accuracy of the perfect fifth a 29 note scale is required, a number which is much less practical than 12.There are of course other “rational” considerations in choosing a scale. Not only the perfect 5th must be approximated well, but also thirds and fourths. It turns out 12 provides a reasonably good approximation for all of these. In other words, the 12 note scale is not just a historical accident, it is a mathematical inevitability.

The question is did Princess Friederike Charlotte of Brandenburg-Schwedt realize that underlying the principle Euler was writing to her about was the key to understanding not only the theory of music, but also a fundamental design in nature?

The number of petals in flowers

An obvious question not addressed by the theorem stated above is how good are the rational approximations of irrational numbers? Obviously a rational number can always be chosen to be arbitrarily close to any irrational number (e.g. by truncating the decimal expansion arbitrarily far along). But the question is interesting when considered, in light of the theorem above, in terms of the accuracy of approximation as a function of its denominator. Hurwitz‘s theorem provides the answer:

Theorem [Hurwitz, 1891]: For every irrational number x, there are infinitely many rational approximations n/d such that

\left| x - \frac{n}{d} \right| < \frac{1}{\sqrt{5}d^2}.

Moreover, the constant \frac{1}{\sqrt{5}} is best possible.

In other words, the approximation can be as good as the inverse of the square of the denominator (this should be contrasted with using the decimal expansion to approximate, which provides accuracy only proportional to the inverse of the denominator). The fact that \frac{1}{\sqrt{5}} is the best possible constant means that there is some irrational number such that if one sets the constant to be smaller, then there are only finitely many rational approximations achieving the improved accuracy. One cannot get more accurate than \frac{1}{\sqrt{5}d^2} infinitely often, and it turns out that the irrational number that is the extremal example is the golden ratio \varphi = \frac{1+\sqrt{5}}{2}. In a sense made precise by the theorem above, the golden ratio is the hardest irrational number to approximate by rationals. 

Who would care about irrational numbers that are hard to approximate by rational numbers? Mothers of newborn twins?! However as whimsical as Ehud Friedgut‘s Murphy’s law sounds, evolution appears to have appropriated the principle of offsetting cycles to optimize branching patterns in plants and trees.

Tree branches around a trunk, or of flower petals around a stem, demand sunlight and ideally do not line up with each other. Suppose that they are equally spaced, separated by a fixed angle \theta. If \theta is rational, then the branches or petals will line up perfectly creating a pattern, viewed from above, of lines emanating from the trunk or main stem. An irrational angle ensures that branches never line up perfectly, however if the irrational number is well approximated by a rational number then the overlap will be extensive. That is, the branches will almost line up. The solution? Choose an irrational angle that is as hard as possible to approximate by rationals. The golden ratio! The angle turns out to be 137.5 degrees.

With the golden ratio, the branching pattern when viewed from above looks like a series of spirals. Depending on the time between the appearance of branches (or seeds at the center of a flower), the number of spirals varies. The numbers though will always correspond to one of the convergents of the (continued fraction expansion of the) golden ratio. These are

\frac{2}{3}, \frac{3}{5}, \frac{5}{8}, \frac{8}{13}, \frac{13}{21}, \ldots.

Note that the denominators are the Fibonacci numbers. Since flower petals typically form at the tips of the spirals, one frequently finds a Fibonacci number of petals on flowers. Fibonacci numbers are all over plants. Nature is irrational.

Georgia-O-Keeffe-Music-Pink-and-Blue-II-1918-us_large

Georgia O’Keeffe: Music Pink and Blue No. 2 (1918). Next time you are pondering what her flowers truly represent, think also of the tension between the rational and the irrational in music, and in nature.

On Satuday I submitted the final grades for Math10A, the new UC Berkeley freshman math class for intended biology majors that I taught this semester. In assigning students their grades, I had a chance to reflect again on the system we use and its substantial shortcomings.

The system is broken, and my grade assignment procedure illustrates why. Math 10a had 223 students this semester, and they were graded according to the following policy: homework 10%, quizzes 10%, midterms 20% each (there were two midterms) and the final 40%. If midterm 1 was skipped then midterm 2 counted 40%. Similarly, if midterm 2 was skipped then the final counted 60%. This produced a raw score for each student and the final distribution is shown below (zeroes not shown):

Raw_scores_math10a_2013

The distribution seems fairly “reasonable”. One student didn’t do any work or show up and got a 5/100. At the other end of the spectrum some students aced the class. The average score was 74.48 and  the standard deviation 15.06. An optimal raw score distribution should allow for detailed discrimination between students (e.g. if everyone gets the same score thats not helpful). I think my distribution could have been a bit better but I overall I am satisfied with it.  The problem comes with the next step: after obtaining raw scores in a class, the professor has to set cutoffs for A+/A/A-/B+/B/B-/C+/C/C-/D+/D/D-/F. Depending on how the cutoffs are set, the grade distribution can change dramatically. In fact, it is easy to see that any discrete distribution on letter grades is achievable from any raw score distribution. One approach to letter grades would be to fix an A at, say, any raw score greater than or equal 90%, i.e., no curving. I found that threshold on wikipedia. But that is rarely how grades are set, partly because of large variability in the difficulty of exams. Almost every professor I know “curves” to some extent. At Berkeley one can examine grade distributions here.

It turns out that Roger Purves from statistics used to aim for a uniform distribution:

Purves_Stat2

Roger Purves’ Stat 2 grade distribution over the past 6 years.

The increase in C- grades is explained by an artifact of the grading system at Berkeley.  If a student fails the class they can take it again and record the passing grade for their GPA (although the F remains on the transcript). A grade of D is not only devastating for the GPA, but also permanent. It cannot be improved by retaking the class. Therefore many students try to fail when they are doing poorly in a class, and many professors simply avoid assigning Ds. In other words, Purves’ C- includes his Ds. Another issue is that an A+ vs. A does not affect GPA, but an A vs. A- does; the latter is obviously a very subjective difference that varies widely between classes and professors. Note that Roger Purves just didn’t assign A+ grades, presumably because they have no GPA significance (although they do arguably have a psychological impact).

Marina Ratner from math failed more students [Update November 9, 2014: Prof. Ratner has pointed out to me that she receives excellent reviews from students on Ratemyprofessors, while explaining that “the large number of F in my classes are due to the large number of students who drop the class but are still on the list or don’t do any work” and that “One of the reasons why my students learned and others did not was precisely because of my grading policy.”]. Her grade distribution for Math 1b in the Spring semester of 2009 is below:

Ratner_1b_2009

Marina Ratner’s Math 1B, Spring 2009.

In the same semester, in a parallel section, her colleague Richard Borcherds gave the following grades:

Borcherds_Math1b_2009

Richard Borcherd’s Math 1B, Spring 2009.

Unlike Ratner, Borcherds appears to be averse to failing students. Only 7 students failed out of 441 who were enrolled in his two sections that semester. Fair?

And then there are those who believe in the exponential distribution, for example Heino Nitsche who teaches Chem 1A:

Nitsche_Chem1_2011

Heino Nitsche’s Chem 1A, Spring 2011.

The variability in grade assignment is astonishing. As can be seen above, curving is prevalent and arbitrary, and the idea that grades have an absolute meaning is not credible. It is statistically highly unlikely that Ratner’s students were always terrible at learning math (whereas Borcherds “luckily” got the good students). Is chemistry inherently easy, to the point where an average student taking the class deserves an A?

This messed up system is different, yet similar in other schools. Sadly, many schools have used letter grading to manipulate GPAs via continual grade inflation. Just three weeks ago on December 3rd, the dean of undergraduate education at Harvard confirmed that the median grade at Harvard is an A- and the most common grade an A. The reasons for grade inflation are manifold. But I can understand it on a personal level. It is tempting for a faculty member to assign As because those are likely to immediately translate to better course evaluations (both internal, and public on sites such as Ninja Courses and ratemyprofessor). Local grade inflation can quickly lead to global inflation as professors, and at a higher level their universities, are competing with each other for the happiest students.

How did I assign letter grades for Math 10A?

After grading the final exams together, my five GSIs started the process of setting letter grade thresholds by examining the grades of “yardstick students”. These were students for which the GSIs felt confident in declaring their absolute knowledge of the material to be at the A,B,C or F levels. We then proceeded to refine the thresholds adding +/- cutoffs by simultaneously trying to respect the yardsticks, while also keeping in mind the overall grade distribution. Finally, I asked the GSIs to consider moving students upward across thresholds if they had shown consistent improvement and commitment throughout the semester (a policy I had informed the students of in class). The result was that about 40% of the students ended with an A. Students failed the class at a threshold where we believed they had not learned enough of the material to proceed to math 10B. I have to say that as far as my job goes, assigning letter grades for courses is the least scientific endeavor I participate in.

What should be done?

Until recently grades were recorded on paper, making it difficult to perform anything but trivial computations on the raw scores or letter grades. But electronic recording of grades allows for more sophisticated analysis. This should be taken advantage of. Suppose that instead of a letter grade, each student’s raw scores were recorded, along with the distribution of class scores. A single letter would immediately be replaced by a meaningful number in context.

I do think it is unfair to grade students only relatively, especially with respect to cohorts that can range in quality. But it should be possible to compute a meaningful custom raw score distribution specific to individual students based on the classes they have taken. The raw data is a 3-way table whose dimensions consist of professors x classes x raw scores. This table is sparse, as professors typically only teach a handful of different courses throughout their career. But by properly averaging the needed distributions as gleaned from this table, it should be possible to produce for each student an overall GPA score, together with a variance of the (student specific) distribution it came from averaged over the courses the student took. The resulting distribution and score could be renormalized to produce a single meaningful number. That way, taking “difficult” classes with professors who grade harshly would not penalize the GPA. Similarly, aiming for easy As wouldn’t help the GPA. And manipulative grade inflation on the part of professors and institutions would be much more difficult.

Its time to level the playing field for students, eliminate the possibility for manipulative grade inflation, and to stop hypocrisy. We need to not only preach statistical and computational thinking to our students, we need to practice it in every aspect of their education.

In 1963, a post-doctoral fellow at Caltech by the name of Emile Zuckerkandl, who was working with Linus Pauling, published a landmark paper that launched the field of comparative genomics. The full citation is:

L. Pauling and E. Zuckerkandl, “Chemical Paleogenetics: Molecular ‘restoration studies’ of extinct forms of life“, Acta Chem. Scand. 17 (1963) , S9–S16.

The paper not only introduces the concept of comparative genomics, but also its generalization to phylogenomics. It followed on the heels of pioneering work by Zuckerkandl, following the advice of Pauling, to sequence hemoglobin proteins in various primates and mammals. In related work, Zuckerkandl and Pauling (it is clear that Zuckerkandl was the creative and driving force) also introduced the concept of a molecular clock and proposed the concept of sequence alignment (although Ingram published first). 

Zuckerkandl’s name and work are not as well known as they ought to be. Perhaps this is partly because his work was overshadowed by Watson and Crick’s famous paper published a decade earlier, but maybe also because computational methods take second place to experiments and Zuckerkandl’s legacy lies in his ideas about phylogenomics, not his contribution to protein sequencing.

The paper after which this blog post is titled is extraordinary in its originality, and the main figure in the paper is a classic that every computational biologist should instantly recognize:

Zuckerkandl_Figure

The caption (from the paper), is as follows: Tentative partial structure of two chain-ancestors of the human hemoglobin polypeptide chains. The numbering of the residues is the one usually applied to the human \alpha-chain. The abbreviations for the amino acids are those commonly employed, except for asparagin (asg) and glutamine (glm). The other abbreviations: E=early epoch; M=medium epoch; L=late epoch; abs=absent. “None”, in column (e), means: probably no evolutionarily effective mutation occurred at the site under consideration in the line of descent leading from the ancestral genes to the human genes. The residues and comments are placed in parentheses when the conclusion reached is partly based on the consideration of human and sperm whale myglobins.

(a) Residue number

(b) Partial sequence of the II^{\beta}--III^{\gamma}--IV^{\delta}-chain (late form)

(c) Partial sequence of the I-^{\alpha} II^{\beta}--III^{\gamma}--IV^{\delta}-chain (late form)

(d) Chain(s) in whose direct ancestry the mutation(s) seem to have occurred

(e) Nature of the substitution

(f) Quantitative evaluation of the time of mutation

(g) From evidence relating to a number of different animals (cf. Gratzer and Allison)

(h) Not after the time of the ancestor common to horse and man

(j) Amino-acid residue x present at a homologous site in the two myoglobins

(k) Polypeptide chain I^{\alpha} or II^{\beta}--III^{\gamma}--IV^{\delta}

The inference of ancestry is based on the hemoglobin gene phylogeny

Zuckerkandl_phylo

The rows (d)–(f) basically tell the story of the paper. Zuckerkandl and Pauling show that with sequence, alignment and phylogeny one can infer the history of mutation and substitution and its effects on extant sequence. Right there, we have the genesis of phylogenomics. In the discussion of the paper there is a detailed description of how phylogenomic techniques can be used to infer functional relationships among proteins, with the conclusion that “even apparently unrelated proteins can indeed have a common molecular ancestor”.

The fields of genomics and phylogenetics have a common ancestor as well but sadly this ancestor, Emile Zuckerkandl, passed away on November 9, 2013. May his ideas continue to evolve.

[Update Dec. 12]: Upon reading the initial version of this post, Ewan Birney asked me how the Zuckerkandl-Pauling alignment looks with modern tools, an interesting question now that 50 years have passed since their paper was publishedI decided to look at this and started by downloaded the sequences and aligning them with three programs: 1) Clustal-Omega, the latest (last) version of the popular and widely used Clustal alignment programs, 2) Muscle, a sequence alignment tool billed by its author as “one of the best-performing multiple alignment programs” and 3) FSA, a sequence alignment program developed in my group by Robert Bradley. The results are interesting:

globin_Clustal_Omega

The Clustal-Omega alignment.

globin_muscle

The muscle alignment.

globin_FSAThe FSA alignment.

It’s a bit complicated to compare these multiple alignments with the Z-P alignment. The latter is a pairwise alignment of the two ancestrally inferred sequences, whereas the multiple alignments above are of the extant sequences. This is already interesting; almost no alignment programs explicitly align by ancestral inference the way Z-P were doing it. The only exception, as far as I know, is MAVID, written by Nicolas Bray, although as implemented it was only suitable for DNA alignment. Nevertheless, the alignment is small enough that I was able to compare Z-P’s alignment to the Clustal-Omega, Muscle and FSA alignments by hand.

There are three areas of the alignment with indels. The first is right in the beginning, where Z-P annotate an indel right after the Valine (V). Both Clustal-Omega and Muscle moved this indel to the beginning of the sequence. That is probably because of heuristics setting gaps to the ends of sequences in case they are incomplete. In this case, I removed the Methionine because Z-P had not included it in their alignment. Nevertheless, FSA aligned this correctly (I think the Z-P alignment here is correct).

The second indel is two bases and different between all three programs and the Z-P alignment. Its a bit hard to say who is correct. One would have to look at more sequences, but I think that FSA looks good here (based on the posterior probabilities shown, see below).

The third indel involves the placement of the amino acids DLSH. In this case only muscle agrees with the Z-P alignment, however I think the FSA answer is the most informative (and only informative) answer in this case. FSA not only shows a single multiple alignment, but also colors it according to the posterior probabilities of the individual pairwise alignments. This is explained in the paper— the visualization is interactive and available via a java applet. What the FSA alignment shows is that there are really two solutions of almost equal accuracy: the one it produced and the Z-P solution. The latter requires two separate indel events. These are presumably much less common than mutations, so it’s not clear which is correct (this could be checked by examining a larger globin alignment including outlier species).

In summary, what the alignments with modern tools show, is that multiple alignment is really hard. Much harder than most people think!

A final analysis I did is to look at the tree inferred according to an evolutionary model (the FSA alignment server does this automatically). The inferred tree is unrooted and shown below:

globin_tree

It agrees with the Z-P tree, except for the fact that the beta-delta ancestor is much closer than what is shown in their figure. It should be noted that at the time of their paper, (Markov) models for amino acid changes along a tree had not yet been developed; that only started happening with Jukes and Cantor in 1967.

Overall, Zuckerkandl and Pauling did a remarkable job with their alignment, ancestral inference and tree estimation. What I learned from my analysis is that their paper still provides a case study for multiple alignment today.

Blog Stats

  • 2,909,441 views
%d bloggers like this: