You are currently browsing the monthly archive for September 2013.

The International Society for Clinical Densitometry has an official position on the number of decimal digits that should be reported when describing the results of bone density scans:

• BMD (bone mineral
density): three digits (e.g., 0.927 g/cm2). •
• T-score: one digit (e.g., –2.3).
•  Z-score: one digit (e.g., 1.7).
•  BMC (bone mineral content): two digits (e.g., 31.76
g).
•  Area: two digits (e.g., 43.25
cm2).
•  % reference database: Integer
(e.g., 82%).
Are these recommendations reasonable? Maybe not. For example they fly in the face of the recommendation in the “seminal” work of Ehrenberg (Journal of the Royal Statistical Society A, 1977)  which is to use two decimal digits.

Two? Three? What should it be? This what my Math10 students always ask of me.

I answered this question for my freshmen in Math10 two weeks ago by using an example based on a dataset from the paper Schwartz, M. “A biomathematical approach to clinical tumor growth“. Cancer 14 (1961): 1272-1294. The paper has a dataset consisting of the size of a pulmonary neoplasm over time:

A simple model for the tumor growth is $f(t) = a \cdot b^t$ and in class I showed how a surprisingly good fit can be obtained by interpolating through only two points ($t=0$ and $t=208$):

$f(0)= 1.8 \Rightarrow a \cdot b^0 = 1.8 \Rightarrow a = 1.8$.

Then we have that $f(208) = 3.5 \Rightarrow 1.8 \cdot b^{208} = 3.5 \Rightarrow b= \sqrt{208}{3.5/1.8} \approx 1.00302$.

The power function $f(t)=1.8 \cdot t^{1.00302}$ is shown in the figure. The fit is surprisingly good considering it is based on an interpolation using only two points. The point of the example is that if one rounds the answer 1.00302 to two decimal digits then one obtains  $f(t) = 1.8t^1 = 1.8t$ which is linear as opposed to super-linear. In other words, a small (quantitative) change in the assumptions (restricting the rate to intervals differing by 0.01) results in a major qualitative change in results:  with two decimal digits the patient lives, with three… death!

This simple example of decimal digit arithmetic illustrates a pitfall affecting many computational biology studies. It is tempting to believe that $\mbox{Qualitative} \subset \mbox{Quantitative}$, i.e. that focusing on qualitative analysis allows for the flexibility of ignoring quantitative assumptions.  However frequently the qualitative devil is in the quantitative details.

One field where qualitative results are prevalent, and therefore the devil strikes frequently, is network theory. The emphasis on searching for “universal phenomena”, i.e. qualitative results applicable to networks arising in different contexts, arguably originates with Milgram’s small world experiment that led to the concept of “six-degree of separation” and Watts and Strogatz’s theory of collective dynamics in small-world networks (my friend Peter Dodds replicated Milgram’s original experiment using email in “An experimental study of search in global social networks“, Science 301 (2003), p 827–829) . In mathematics these ideas have been popularized via the Erdös number which is the distance between an author and Paul Erdös in a graph where two individuals are connected by an edge if they have published together. My Erdös number is 2, a fact that is of interest only in that it divulges my combinatorics roots. I’m prouder of other connections to researchers that write excellent papers on topics of current interest. For example, I’m pleased to be distance 2 away from Carl Bergstrom via my former undergraduate student Frazer Meacham (currently one of Carl’s Ph.D. students) and the papers:

1. Meacham, Frazer, Dario Boffelli, Joseph Dhahbi, David IK Martin, Meromit Singer, and Lior Pachter. “Identification and Correction of Systematic Error in High-throughput Sequence Data.” BMC Bioinformatics 12, no. 1 (November 21, 2011): 451. doi:10.1186/1471-2105-12-451.
2. Meacham, Frazer, Aaron Perlmutter, and Carl T. Bergstrom. “Honest Signaling with Costly Gambles.” Journal of the Royal Society Interface 10, no. 87 (October 6, 2013):20130469. dpi:10.1098/rsif.2013.0469.
One of Bergstrom’s insightful papers where he exposes the devil (in the quantitative details) is “Nodal Dynamics, Not Degree Distributions, Determine the Structural Controllability of Complex Networks”  by Cowan et al., PLoS One 7 (2012), e88398. It describes a not-so-subtle example of an unreasonable quantitative assumption that leads to intuition about network structural controllability that is, to be blunt, false. The example Carl critiques is from the paper

Controllability of complex networks” by Yang-Yu Liu, Jean-Jacques Slotine and Albert-László Barabási, Nature 473 (2011), p 167–173. The mathematics is straightforward: It concerns the dynamics of linear systems of the form

$\frac{d{\bf x}(t)}{dt} =-p{\bf x}(t) + A{\bf x}(t) + B{\bf u}(t)$.

The dynamics can be viewed as taking place on a graph whose adjacency matrix is given by the non-zero entries of A (an nxn matrix). The vector -p (of size nis called the pole of the linear system and describes intrinsic dynamics at the nodes. The vector (of size mcorresponds to external inputs that are coupled to the system via the nxm matrix B.

An immediate observation is that the vector p is unnecessary and can be incorporated into the diagonal of the matrix A. An element on the diagonal of A that is then non-zero can be considered to be a self-loop. The system then becomes

$\frac{d{\bf x}(t)}{dt} =A{\bf x}(t) + B{\bf u}(t)$

which is the form considered in the Liu et al. paper (their equation(1)). The system is controllable if there are time-dependent u that can drive the system from any initial state to a  target end state. Mathematically, this is equivalent to asking whether the matrix $C=(B,AB,A^2B,\ldots, A^{n-1}B)$ has full rank (a classic result known as Kalman’s criteria of controllability). Structural controllability is a weaker requirement, in which the question is  whether given only adjacency matrices and B, there exists weights for edges so that the weighted adjacency matrices satisfy Kalman’s criteria. The point of structural controllability is a theorem showing that structurally controllable systems are generically controllable.

The Liu et al. paper makes two points: the first is that if M is the size of a maximum matching in a given nxn adjacency matrix A, then the minimum m for which there exists a matrix B of size nxm for which the system is structurally controllable is m=n-M+1 (turns out this first point had already been made, namely in a paper by Commault et al. from 2002). The second point is that m is related to the degree distribution of the graph A.

The point of the Cowan et al. paper is to explain that the results of Liu et al. are completely uninteresting if $a_{ii}$  is non-zero for every i. This is because M is then equal to n (the matching matching of A consists of every self-loop). And therefore the result of Liu et al. reduces to the statement that m=1, or equivalently, that structural controllability for real-world networks can also be achieved with a single control input.

Unfortunately for Liu and coauthors, barring pathological canceling out of intrinsic dynamics with self-regulation, the diagonal elements of A, $a_{ii}$ will be zero only if there are no intrinsic dynamics to the system (equivalently $p_i=0$ or the time constants $\frac{1}{p_i}$ are infinite).  I repeat the obvious by quoting from Cowan et al.:

“However, infinite time constants at each node do not generally reflect the dynamics of the physical and biological systems in Table 1 [of Liu et al.]. Reproduction and mortality schedules imply species- specific time constants in trophic networks. Molecular products spontaneously degrade at different rates in protein interaction networks and gene regulatory networks. Absent synaptic input, neuronal activity returns to baseline at cell-specific rates. Indeed, most if not all systems in physics, biology, chemistry, ecology, and engineering will have a linearization with a finite time constant. ”

Cowan et al. go a bit further than simply refuting the premise and results of Liu et al. They avoid the naïve reduction of a system with intrinsic dynamics to one with self-loops, and provide a specific criterion for the number of nodes in the graph that must be controlled.

In summary, just as with the rounding of decimal digits, a (simple looking) assumption of Liu et al., namely that $p=0$ completely changes the qualitative nature of the result. Moreover, it renders false the thesis of the Liu et al. paper, namely that degree distributions in (real) networks affect the requirements for controllability.

Oops.

I recently came across a wonderful website of Karl Broman via his homepage.  He hosts a list of top ten worst graphs in which he describes examples of badly constructed/displayed graphs from statistical and computational biology papers. He calls  out mistakes of others, not in order to humiliate the authors, but rather to educate via example.  Its a list every computational biologist should peruse and learn from. The page also has an excellent list of recommended references, including Tufte’s “The visual display of quantitative information” which inspired the Minard plots for transcript abundance shown in Figure 2b and Appendix B of the supplement of my Cufflinks paper.

Interestingly, in addition to the list of graphs, Karl includes a list of tables but with only a single entry (the problem with the table is discussed here). Why should a list of worst graphs have 10 items but a list of tables only one? This is a blatant example of protectionism of tables at the expense of the oft abused graphs. With this post I’d like to remedy this unfair situation and begin the process of assembling a list of top ten worst tables: the tabular hall of shame.

I begin by offering Supplementary Table S6 from the paper “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions” by Luke Ward and Manolis Kellis, Science 337 (2012) 1675–1678:

The table is in support of a main result of the paper, namely that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% is subject to lineage-specific constraint. This result is based on adding up the estimates of constrained nucleotides from Table S6 (using column 10, the derived allele frequency measure). These estimates were calculated using a statistic that is computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every ENCODE feature F, the average derived allele frequency value DF was rescaled to

$PUCF = \frac{DF-DCNDC}{DNCNE-DCNDC}$,

where DCNDC and DNCNE are the bin-specific average derived allele frequencies of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively.

One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would be needed in order to identify human specific constraint. Second, the statistic PUCF is used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there are four values among the confidence intervals for the estimated proportions using derived allele frequency that include values less than 0% or above 100%, for example:

Ward and Kellis are therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that PUCF might correlate with the true proportion of nucleotides under constraint, there is no argument provided in the paper. Thus, while Ward and Kellis claim to have estimated the proportion of nucleotides under constraint, they have only computed a statistic named “proportion under constraint”.

The bottom line is that a table containing percentages should not have negative entries, and if it does reader beware!

On his worst graphs website Karl provides one example of a what he calls “a really bad table” and here I have offered a second (amazingly Table S5 from the Ward-Kellis paper is also a candidate for the list– Nicolas Bray and I review its sophistry in an arXiv note–  but I think every paper should be represented only once on the list). I ask you, the reader, to help me round out the list by submitting more examples in the comments or by email directly to me. Tables from my papers are fair game (notably example #10 of a bad graph on Karl’s list is from one of his own papers). Please help!

I recently completed a term as director of our Center for Computational Biology and one of the things I am proud of having accomplished is helping Brian McClendon, program administrator for the Center, launch a new Ph.D. program in computational biology at UC Berkeley.

The Ph.D. program includes a requirement for taking a new course, “Classics in Computational Biology” (CMPBIO 201), that introduces students to research projects and approaches in computational biology via a survey of classic papers. I taught the class last week and the  classic paper I chose was  “Comparison of biosequences” by Temple Smith and Michael Waterman, Advances in Applied Mathematics 2 (1981), p 482–489.  I gave a preparatory lecture this past week in which I discussed the Needleman-Wunsch algorithm, followed by leading a class discussion about the Smith-Waterman paper and related topics. This post is an approximate transcription of my lecture on the Needleman-Wunsch algorithm:

The Needleman-Wunsch algorithm was published in “A general method applicable to the search for similarities in the amino acid sequence of two proteins“, Needleman and Wunsch, Journal of Molecular Biology 48 (1970), p 443–453. As with neighbor-joining or UniFrac that I just discussed in my previous blog post, the Needleman-Wunsch algorithm was published without an explanation of its meaning, or even a precise description of the dynamic programming procedure it is based on. In their paper 11 years later, Smith and Waterman write

“In 1970 Needleman and Wunsch [1] introduced their homology (similarity) algorithm. From a mathematical viewpoint, their work lacks rigor and clarity. But their algorithm has become widely used by the biological community for sequence comparisons.”

They go on to state that “The statement in Needleman-Wunsch [1] is less general, unclearly stated, and does not have a proof” (in comparison to their own main Theorem 1). The Theorem Smith and Waterman are referring to explicitly describes  the dynamic programming recursion only implicit in Needleman-Wunsch, and then explains why it produces the “optimal” alignment. In hindsight, the Needleman-Wunsch algorithm can be described formally as follows:

Given two sequences and b of lengths $n_1$ and $n_2$ respectively, an alignment $A(a,b)$ is a sequence of pairs:

$[(a_{i_1},b_{j_1}),(a_{i_2},b_{j_2}),\ldots : 1 \leq i_1 < i_2 < \cdots \leq n_1, 1 \leq j_1 < j_2 \cdots \leq n_2 ]$.

Given an alignment $A(a,b)$, we let

$M = | \{ k: a_{i_k} = b_{j_k} \} |, X = | \{ k: a_{i_k} \neq b_{j_k} \} |$

and

$S = | \{ r:a_r \neq i_k \, \mbox{for any k} \} | + | \{ r:b_r \neq j_k \, \mbox{for any k} \} |$.

In other words, M counts the number of matching characters in the alignment, X counts the number of mismatches and S the number of “spaces”, i.e. characters not aligned to another character in the other sequence. Since every character must be either matched, mismatched, or a not aligned, we have that

$2 M + 2 X + S = n_1+n_2$.

Given scores (parameters) consisting of real numbers $m,x,s$, the Needleman-Wunsch algorithm finds an alignment that maximizes the function $m \cdot M + x \cdot X + s \cdot S$. Note that by the linear relationship between M,X and described above, there are really only two free parameters, and without loss of generality we can assume they are and x. Furthermore, only and are relevant for computing the score of the alignment; when the scores are provided statistical meaning, one can say that and X are sufficient statistics. We call them the summary of the alignment.

The specific procedure of the Needleman-Wunsch algorithm is to compute a matrix S recursively:

$S(i,j) = \mbox{max} \{ S_{i-1,j} + s,S_{i,j-1} + s, S_{i-1,j-1} + xI(a_{i} = b_{j} )\}$

where $I(a_i = b_j)$ is the indicator function that is 1 if the characters are the same, and zero otherwise, and an initialization step consists of setting $S(i,0)=s \cdot i$ and $S(0,j) = s \cdot j$ for all i,j. There are numerous generalizations and improvements to the basic algorithm, both in extending the scoring function to include more parameters (although most generalizations retain the linearity assumption), and algorithms for trading off time for space improvements (e.g. divide-and-conquer can be used to improve the space requirements from $O(n_1 n_2)$ to $O(n_1 + n_2)$).

An important advance in understanding the Needleman-Wunsch algorithm was the realization that the “scores” m,x and s can be interpreted as logarithms of probabilities if the sign on the parameters is the same, or as log-odds ratios if the signs are mixed. This is discussed in detail in the book “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Sequences” by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison.

One insight from the statistical point of view is that if  max and plus in the recursion are replaced with plus and times, and the logarithms of probabilities are replaced by the probabilities themselves, then the Needleman-Wunsch recursion computes the total probability of the pairs of sequences marginalized over alignments (i.e., it is the forward algorithm for a suitably defined hidden Markov model). Together with the backward algorithm, which consists of running the forward algorithm on the reversed sequences, this provides an efficient way to compute for every pair of nucleotides the posterior probability that they are aligned. In other words, there is a meaningful interpretation and useful application for the Needleman-Wunsch algorithm with the semi-ring $(\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +)$ replaced with $(\mathbb{R}, \oplus := +, \otimes := \times)$.

In work in collaboration with Bernd Sturmfels, we realized that it also makes sense to run the Needleman-Wunsch algorithm with the  polytope algebra $(\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum})$. Here $\mathcal{P}$ is  the set of convex polytopes in $\mathbb{R}^d$ for some d, polytopes are “added” using the geometric operation of the convex hull of the union of polytopes, and they are “multiplied” by taking Minkowski sum. The result allows one to find not only a single optimal alignment, but all optimal alignments.  The details are discussed in a pair of papers:

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138–16143.

C. Dewey, P. Huggins, K, Woods, B. Sturmfels and L. Pachter, Parametric alignment of Drosophila genomes, PLoS Computational Biology, Volume 2, Number 6 (2006) p e73.

The first explains the mathematical foundations for using the result and is where we coined the term polytope propagation algorithm for the Needleman-Wunsch algorithm with the polytope algebra. The second paper illustrates an application to a whole-genome alignment of two Drosophilids and provides an alternative to polytope propagation (the beneath-beyond algorithm) that is much faster. The type of analysis performed in the paper is also called parametric alignment, a problem with a long history whose popularity was launched by Dan Gusfield who provided the first software program for parametric alignment called XPARAL.

The polytope propagation algorithm (Needleman-Wunsch with the polytope algebra) is illustrated in the figure below:

The convex polygon in the bottom right panel contains in its interior points corresponding to all possible summaries for the alignments of the two sequences. The optimal summaries are vertices of the polygon. For each vertex, the parameters for which the Needleman-Wunsch algorithm will produce a summary corresponding to that vertex form a cone. The cones together form a fan which is shown inside the polygon.

The running time of polytope propagation depends on the number of vertices in the final polytope. In our papers we provide bounds showing that polytope propagation is extremely efficient. In particular, for parameters, the number of vertices is at most $O(n^{\frac{d(d-1)}{(d+1)}})$Cynthia Vinzant, formerly in our math department at UC Berkeley, has written a nice paper on lower bounds.

The fact that the Needleman-Wunsch algorithm can be run with three semi-rings means that it is particularly well suited to C++ thanks to templates that allow for the variables in the recursion to be abstracted. This idea (and code) is thanks to my former student Colin Dewey:

template<typename SemiRing>

void
alignGlobalLastRow(const string& seq1,
const string& seq2,
const typename SemiRing::Element match,
const typename SemiRing::Element mismatch,
const typename SemiRing::Element gap,
vector& row);
const Element one = SemiRing::multiplicativeIdentity;
// Initialize row
row.resize(seq2.size() + 1);
row[0] = one;
// Calculate first row
for (size_t j = 1; j <= seq2.size(); ++j)
row[j] = gap * row[j - 1];
// Calculate remaining rows
Element up, diag;
for (size_t i = 1; i <= seq1.size(); ++i) {
diag = row[0];
row[0] *= gap;
for (size_t j = 1; j <= seq2.size(); ++j) {
up = row[j];
if (seq1[i - 1] == seq2[j - 1]) {
row[j] = match * diag + gap * (up + row[j - 1]);
} else {
row[j] = mismatch * diag + gap * (up + row[j - 1]);
}
diag = up;
}
}


To recap, the three semi-rings with which it makes sense to run this algorithm are:

1. $(\mathbb{R}, \oplus := +, \otimes := \times)$
2. $(\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +)$
3. $(\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum})$

The interpretation of the output of the algorithm with the three semi-rings is:

1. $S_{ij}$ is the score (probability) of a maximum alignment between the ith prefix of one sequence and the jth prefix of the other.
2. $S_{ij}$ is the total score (probability) of all alignments between the ith prefix of one sequence and the jth prefix of the other.
3. $S_{ij}$ is a polytope whose faces correspond to summaries that are optimal for some set of parameters.

The semi-rings in (2) and (3) lead to particularly interesting and useful applications of the Needleman-Wunsch algorithm. An example of the use of (2), is in posterior decoding based alignment where nucleotides are aligned as to maximize the sum of the posterior probabilities computed by (2). In the paper “Alignment Metric Accuracy” (with Ariel Schwartz and Gene Myers) we  provide an interpretation in terms of an interesting metric on sequence alignments and in

A.S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics 23 (2007), e24–e29

we show how to adapt posterior decoding into a (greedy) multiple alignment algorithm. The video below shows how it works, with each step consisting of aligning a single pair of nucleotides from some pair of sequences using posterior probabilities computed using the Needleman-Wunsch algorithm with semi-ring (2):

We later extended this idea in work together with my former student Robert Bradley in what became FSA published in the paper “Fast Statistical Alignment“, R. Bradley et al., PLoS Computational Biology 5 (2009), e1000392.

An interesting example of the application of Needleman-Wunsch with semi-ring (3), i.e. polytope propagation, is provided in the recent paper “The RNA Newton polytope and learnability of energy parameters” by Elmirasadat Forouzmand and Hamidreza Chitsaz, Bioinformatics 29 (2013), i300–i307. Using our polytope propagation algorithm, they investigate whether simple models for RNA folding can produce, for some set of parameters, the folds of sequences with solved structures (the answer is sometimes, but not always).

It is remarkable that forty three years after the publication of the Needleman-Wunsch paper, and thirty two years after the publication of the Smith-Waterman paper, the algorithms remain pertinent, widely used, and continue to reveal new secrets. True classics.

Where are the authors now?

Saul B. Needleman writes about numismatics.

Christian Wunsch is a clinical pathologist in the University of Miami health system.

Temple Smith is Professor Emeritus of Biomedical Engineering at Boston University.

Michael Waterman is Professor of Biological Sciences, Mathematics and Computer Science at the University of Southern California.

“Basically, I’m not interested in doing research and I never have been. I’m interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it.” – David Blackwell

The importance of understanding is unfortunately frequently downplayed in computational biology, where methods are the slums of Results, and are at best afforded a few sentences in supplementary materials of research articles.

Nevertheless, there are a handful of examples to the contrary, an important one being the paper “Neighbor-joining revealed” by Olivier Gascuel and Mike Steel, Molecular Biology and Evolution 23 (2006), p 1997–2000, which I have valued for many years and inspired me to write this post about a similar type of paper. In the neighbor-joining case, Gascuel and Steel review a series of results showing that neighbor-joining, first published in the form of an ad-hoc recipe with only metaphorical motivation, is actually a greedy algorithm for optimizing a least squares loss function (“Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting“, R Desper and O Gascuel, Molecular Biology and Evolution 21 (2004), p 587–598). The particular form of the minimization criterion emerges from a formula of Pauplin  (“Direct calculation of a tree length using a distance matrix“, Y Paulin, Journal of Molecular Evolution 10 (1993), p 1073–1095) that in turn has deeper roots. This is a story me and my students have played a small role in, providing some insight into the robustness of the neighbor-joining algorithm in “Why neighbor joining works” and an explanation for the surprisingly simple Pauplin’s formula in “Combinatorics of least squares trees“. These series of insights into neighbor-joining turn out to have practical significance. Results in the “Combinatorics of least squares trees” paper were recently extended by Fabio Pardi and Olivier Gascuel in “Combinatorics of distance-based tree inference“, where they synthesize all of the previous results on neighbor-joining into a blueprint for improving distance based methods. The details of this story will be the subject of a future blog post; here I discuss another beautiful example of work that involves understanding as opposed to research:

I have a long-standing interest in the computational problems that arise in metagenomics, starting with work I did with my former student Kevin Chen on “Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities“.  Recently, while drawing parallels between RNA-Seq and metagenomics (with Lorian Schaeffer and Kevin McLoughlin), I was thinking again about problems encountered in the paper “Disordered microbial communities in asthmatic airways” in which I played a minor role assisting William Cookson. Via a long chain of citation chasing, I eventually stumbled upon the elegant article “The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples” by Steven Evans and Erick Matsen, Journal of the Royal Statistical Society 74 (2012), p 569–592, in which the authors explain the meaning of the (weighted) UniFrac distance for metagenomic samples. Read the rest of this entry »

Don’t believe the anti-hype. They are saying that RNA-Seq promises the discovery of new expression events, but it doesn’t deliver:

Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):

The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide transcript level resolution whereas RNA-Seq can’t!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.

The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:

But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what is used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:

The maximum likelihood (ML) abundances $\hat{\rho}_t$ are  given by

$\hat{\rho}_t = \frac{\frac{\hat{\alpha}_t}{l_t}}{\sum_{r \in T} \frac{\hat{\alpha}_r}{l_r}} \propto \frac{X_t}{\left( \frac{l_t}{10^3}\right) \left( \frac{N}{10^6}\right) }$

In these equations $l_t$ is the length of transcript (if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the $\hat{\alpha}_t$ are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally $X_t$ is the number of reads mapping to transcript t while N is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances ($\hat{\rho}$) scaled by some factors.

RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in $X_t$) are fragments. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the $\hat{\rho}_t$ which are in many cases tiny numbers since they must all sum to 1.

Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the $l_t$ length terms removed from the denominators. Therefore RPM is proportional to the $\hat{\alpha}_t$. If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its $\hat{\alpha}_t$ will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).

To be fair to Affymetrix the conversion between the $\hat{\alpha}$ and the $\hat{\rho}$ can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:

Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.

Having said this, it is true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.

There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq.  Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.

For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):

There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104  show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.

I  learned today about a cute Fibonacci fact from my friend Federico Ardila:

1/89 = 0.011235
1/9899 = 0.0001010203050813213455
1/998999 = 0.000001001002003005008013021034055089144233377610...

$\vdots$

The pattern is explained in a note in the College Mathematics Journal (2003). Of course Fibonacci numbers are ubiquitous in biology, but in thinking about this pattern I was reminded of a lesser known connection between Fibonacci numbers and biology in the context of the combinatorics of splicing:

A stretch of DNA sequence with $n$ acceptor sites and $m$ donor sites can produce at most $F_{n+m+1}$ distinct spliced transcripts, where the numbers $F_i$ are the Fibonacci numbers.

The derivation is straightforward using induction: to simplify notation we denote acceptor sites with an open parenthesis “(” and donor sites with a closed parenthesis “)”. We  use the notation $|S|$ for the length of a string S of open and closed parentheses, and denote the maximum number of transcripts that can be spliced from S by $p(S)$. We assume that the theorem is true by induction for $|S| \leq n-1$ (the base case is trivial). Let S be a string with $|S|=n$. Observe that S must have an open parenthesis somewhere that is followed immediately be a closed parentheses. Otherwise we have that $p(S)=1$ (the empty string is considered to be a valid transcript). We therefore have $S= S_1()S_2$ where $S_1$ has open and r closed parentheses respectively, and $S_2$ has $n-k-r-s-2$ open and s closed parentheses respectively. Now notice that

$p(S) \leq F_{r+k+1}F_{n-k-r}+F_{r+k+2}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1}$.

This can be seen by breaking down the terms as follows: One can take any transcript in $S_1$ and append to it a transcript in “$(S_2$“. Similarly, one can take a transcript in $S_1)$ and append to it a transcript in $S_2$. Transcripts ommitting the interior pair () between $S_1$ and $S_2$ are counted twice, which is fine because one of the copies corresponds to all transcripts that include the interior pair () between $S_1$ and $S_2$. The last two terms account for all transcripts whose last element in $S_1$ is an open parenthesis, and whose first element in $S_2$ is a closed parenthesis. This is counted by considering all transcripts in $S_1S_2$ and subtracting transcripts that do not include a parentheses from each. Finally, using the Fibonacci recurrence and the identity

$F_{n+m} = F_{n+1}F_m + F_nF_{m-1}$

we have

$p(S) \leq F_n + F_{r+k+1}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1} = F_{n+1}$.

The bound is attained for certain configurations, such as $S = ()() \cdots ()$ with acceptor and donor sites.

The combinatorics is elementary and it only establishes what is already intuitive and obvious: splicing combinatorics dictates that there are a lot of transcripts (exponentially many in the number of acceptor and donor sites) that can, in principle, be spliced together, even from short DNA sequences. The question, then, is why do most genes have so few isoforms? Or do they?