You are currently browsing the tag archive for the ‘Jukes-Cantor’ tag.

Continuous-time Markov chain models for DNA mutations on a phylogenetic tree (e.g. the Jukes-Cantor model, the Kimura models, and more generally models of the Felsenstein hierarchy) have the simple and convenient property of multiplicativity. Specifically, if Q is a rate matrix then the associated substitution matrices are multiplicative in the following sense:

$e^{Q(t_1+t_2)} = e^{Qt_1}e^{Qt_2}$.

This follows directly from the fact that the matrices $Qt_1$ and $Qt_2$ commute, because for any two commuting matrices A and B

$e^{A+B} = e^{A}e^{B}$.

This means that substitutions over a time period 2t are equivalently described as substitutions occurring over a time period t, followed by substitutions occurring afterwards over another time period t.

But what if over the course of time the rate matrix changes? For example, suppose that for a period of time t mutations proceed according to a rate matrix Q, and following that, for another period of time t,  mutations proceed according to a rate matrix R? Is it true that the substitutions after time 2t will behave as if mutations occurred for a time 2t according to the (average) rate matrix $\frac{Q+R}{2}$?

If Q and R commute the answer will be yes, as Qt and Rt will also be commutative and the multiplicativity property will hold. But what if Q and don’t commute? Is there any relationship at all between $e^{\frac{Q+R}{2}2t}$ and the matrices $e^{Qt}$ and $e^{Rt}$?

This week I visited Yale University to give a talk in the Center for Biomedical Data Science seminar series.  I was invited by Smita Krishnaswamy, who organized a wonderful visit that included many interesting conversations not only in computational biology, but also applied math, computer science and statistics (Yale has strong programs in applied mathematics, statistics and data science, computer science and biostatistics). At dinner I learned from Dan Spielman of the Golden-Thompson inequality which provides a beautiful answer to the question above in the case where Q and R are symmetric. The theorem is a trace inequality for Hermitian matrices A and B:

$tr(e^{A+B}) \leq tr(e^Ae^B)$.

This inequality is well known in statistical mechanics and random matrix theory but I don’t believe it is known in the phylogenetics community, hence this post. The phylogenetic interpretation of the pieces of the Golden-Thompson inequality (replacing A with Qt and B with Rt) is straightforward:

• The matrices $e^{Qt}$ and $e^{Rt}$ are substitution matrices for the rate matrices Q and R respectively.
• The product $e^{Qt}e^{Rt}$ is the substitution matrix corresponding to mutations occurring with rate matrix Q for time t followed by rate matrix R for time t.
• The matrix $e^{Qt+Rt} = e^{\frac{Q+R}{2} \cdot 2t}$ is the substitution matrix for mutations occurring with rate $\frac{Q+R}{2}$ for time 2t.
• Since the trace of a substitution matrix is the probability that there is no transition, or equivalently the probability that a change in nucleotide does not occur, the Golden-Thompson inequality states that for two symmetric rate matrices and R, the probability of a substitution after time 2t is higher when mutations occur first at rate Q for time t and then at rate R for time t, than if they occur at rate $\frac{Q+R}{2}$ for time 2t.

In other words, rate changes decrease the expected number of substitutions in comparison to what one would see if rates are constant

The Golden-Thompson inequality was discovered independently by Sidney Golden and Colin Thompson in 1965. A proof is explained in an expository blog post by Terence Tao who heard of the Golden-Thompson inequality only eight years ago, which makes me feel a little bit better about not having heard of it until this week! It would be nice if there was a really simple proof but that appears not to be the case (there is a purported one page proof in a paper titled Golden-Thompson from Davis, however what is proved there is the different inequality $tr(e^{A+B}) \leq tr(e^A)tr(e^B)$, which can be shown, by virtue of another matrix trace inequality, to be a weaker inequality).

There is considerable interest in evolutionary biology in models that allow for time-varying rates of mutation, as there is substantial evidence of such variation. The Golden-Thompson inequality provides an additional insight for how mutation rate changes over time can affect naïve estimates based on homogeneity assumptions.

The Felsenstein hierarchy (from Algebraic Statistics for Computational Biology).

“An entertaining freshness… Tic Tac!” This is Ferrero‘s tag line for its most successful product, the ubiquitous Tic Tac. And the line has stuck. As WikiHow points out in how to make your breath freshfirst buy some mints, then brush your teeth.

One of the amazing things about Tic Tacs is that they are sugar free. Well… almost not. As the label explains, a single serving (one single Tic Tac) contains 0g of sugar (to be precise, less than 0.5g, as explained in a footnote). In what could initially be assumed to be a mere coincidence, the weight of a single serving is 0.49g. It did not escape my attention that 0.50-0.49=0.01. Why?

To understand it helps to look at the labeling rules of the FDA. I’ve reproduced the relevant section (Title 21) below, and boldfaced the relevant parts:

 TITLE 21–FOOD AND DRUGS
 CHAPTER I–FOOD AND DRUG ADMINISTRATION DEPARTMENT OF HEALTH AND HUMAN SERVICES
 SUBCHAPTER B–FOOD FOR HUMAN CONSUMPTION

(c) Sugar content claims –(1) Use of terms such as “sugar free,” “free of sugar,” “no sugar,” “zero sugar,” “without sugar,” “sugarless,” “trivial source of sugar,” “negligible source of sugar,” or “dietarily insignificant source of sugar.” Consumers may reasonably be expected to regard terms that represent that the food contains no sugars or sweeteners e.g., “sugar free,” or “no sugar,” as indicating a product which is low in calories or significantly reduced in calories. Consequently, except as provided in paragraph (c)(2) of this section, a food may not be labeled with such terms unless:

(i) The food contains less than 0.5 g of sugars, as defined in 101.9(c)(6)(ii), per reference amount customarily consumed and per labeled serving or, in the case of a meal product or main dish product, less than 0.5 g of sugars per labeled serving; and

(ii) The food contains no ingredient that is a sugar or that is generally understood by consumers to contain sugars unless the listing of the ingredient in the ingredient statement is followed by an asterisk that refers to the statement below the list of ingredients, which states “adds a trivial amount of sugar,” “adds a negligible amount of sugar,” or “adds a dietarily insignificant amount of sugar;” and

(iii)(A) It is labeled “low calorie” or “reduced calorie” or bears a relative claim of special dietary usefulness labeled in compliance with paragraphs (b)(2), (b)(3), (b)(4), or (b)(5) of this section, or, if a dietary supplement, it meets the definition in paragraph (b)(2) of this section for “low calorie” but is prohibited by 101.13(b)(5) and 101.60(a)(4) from bearing the claim; or

(B) Such term is immediately accompanied, each time it is used, by either the statement “not a reduced calorie food,” “not a low calorie food,” or “not for weight control.”

It turns out that Tic Tacs are in fact almost pure sugar. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying  the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an arbitrary threshold of the FDA. Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”.  Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at every site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou et al.  The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

To answer the question, it is helpful to consider how one might estimate the abundance of a transcript after digital normalization. One naive approach is to just count the number of reads after de-duplication, followed by normalization for the length of the transcript and the number of reads sequenced. Specifically if there are sites where a read might start, and of the sites had at least one read, then the naive approach would be to use the estimate $\frac{k}{n}$ suitably normalized for the total number of reads in the experiment. This is exactly what is done in standard de-duplication pipelines, or in digital normalization as described in the preprint by Brown et al. However assuming a simple model for sequencing, namely that every read is selected by first choosing a transcript according to a multinomial distribution and then choosing a location on it uniformly at random from among the sites, a different formula emerges.

Let be a random variable that denotes the number of sites on a transcript of length n that are covered in a random sequencing experiment, where the number of reads starting at each site of the transcript is Poisson distributed with parameter c (i.e., the average coverage of the transcript is c). Note that

$Pr(X \geq 1) = 1-Pr(X=0) = 1-e^{-c}$.

The maximum likelihood estimate for can also be obtained by the method of moments, which is to set

$\frac{k}{n} = 1-e^{-c}$

from which it is easy to see that

$c = -log(1-\frac{k}{n})$.

This is the same as the (derivation of the) Jukes-Cantor correction in phylogenetics where the method of moments equation is replaced by $\frac{4}{3}\frac{k}{n} = 1-e^{-\frac{4}{3}c}$ yielding $D_{JC} = -\frac{3}{4}log(1-\frac{4}{3}\frac{k}{n})$, but I’ll leave an extended discussion of the Jukes-Cantor model and correction for a future post.

The point here, as noticed by Bo Li, is that since $log(1-x) \approx -x$ by Taylor approximation, it follows that the average coverage can be estimated by $c \approx \frac{k}{n}$. This is exactly the naive estimate of de-duplication or digital normalization, and the fact that $\frac{k}{n} \rightarrow 1$ as $k \rightarrow n$ means that $-log(1-\frac{k}{n})$ blows up, at high coverage hence the results of Zhou et al.

Digital normalization as proposed by Brown et al. involves possibly thresholding at more than one read per site (for example choosing a threshold C and removing all but at most C reads at every site). But even this modified heuristic fails to adequately relate to a probabilistic model of sequencing. One interesting and easy exercise is to consider the second or higher order Taylor approximations. But a more interesting approach to dealing with amplification bias is to avoid thresholding per se,  and to instead identify outliers among duplicate reads and to adjust them according to an estimated distribution of coverage. This is the approach of Hashimoto et al. in a the paper “Universal count correction for high-throughput sequencing” published in March in PLoS One. There are undoubtedly other approaches as well, and in my opinion the issue will received renewed attention in the coming year as the removal of amplification biases in single-cell transcriptome experiments becomes a priority.

As mentioned above, digital normalization/de-duplication is just one of many thresholds applied in a typical RNA-Seq “pipeline”. To get a sense of the extent of thresholding, one need only scan the (supplementary?) methods section of any genomics paper. For example, the GEUVADIS RNA-Seq consortium describe their analysis pipeline as follows:

“We employed the JIP pipeline (T.G. & M.S., data not shown) to map mRNA-seq reads and to quantify mRNA transcripts. For alignment to the human reference genome sequence (GRCh37, autosomes + X + Y + M), we used the GEM mapping suite24 (v1.349 which corresponds to publicly available pre-release 2) to first map (max. mismatches = 4%, max. edit distance = 20%, min. decoded strata = 2 and strata after best = 1) and subsequently to split-map (max.mismatches = 4%, Gencode v12 and de novo junctions) all reads that did not map entirely. Both mapping steps are repeated for reads trimmed 20 nucleotides from their 3′-end, and then for reads trimmed 5 nucleotides from their 5′-end in addition to earlier 3′-trimming—each time considering exclusively reads that have not been mapped in earlier iterations. Finally, all read mappings were assessed with respect to the mate pair information: valid mapping pairs are formed up to a maximum insert size of 100,000 bp, extension trigger = 0.999 and minimum decoded strata = 1. The mapping pipeline and settings are described below and can also be found in https://github.com/gemtools, where the code as well as an example pipeline are hosted.”

This is not a bad pipeline- the paper shows it was carefully evaluated– and it may have been a practical approach to dealing with the large amount of RNA-Seq data in the project. But even the first and seemingly innocuous thresholding to trim low quality bases from the ends of reads is controversial and potentially problematic. In a careful analysis published earlier this year, Matthew MacManes looked carefully at the effect of trimming in RNA-Seq, and concluded that aggressive trimming of bases below Q20, a standard that is frequently employed in pipelines, is problematic. I think his Figure 3, which I’ve reproduced below, is very convincing:

It certainly appears that some mild trimming can be beneficial, but a threshold that is optimal (and more importantly not detrimental) depends on the specifics of the dataset and is difficult or impossible to determine a priori. MacManes’ view (for more see his blog post on the topic) is consistent with another paper by Del Fabbro et al. that while seemingly positive about trimming in the abstract, actually concludes that “In the specific case of RNA-Seq, the tradeoff between sensitivity (number of aligned reads) and specificity (number of correctly aligned reads) seems to be always detrimental when trimming the datasets (Figure S2); in such a case, the modern aligners, like Tophat, seem to be able to overcome low quality issues, therefore making trimming unnecessary.”

Alas, Tic Tac thresholds are everywhere. My advice is: brush your teeth first.