You are currently browsing the category archive for the ‘q-bio.GN’ category.

“An entertaining freshness… Tic Tac!” This is Ferrero‘s tag line for its most successful product, the ubiquitous Tic Tac. And the line has stuck. As WikiHow points out in how to make your breath fresh: *first* buy some mints,* then *brush your teeth.

One of the amazing things about Tic Tacs is that they are sugar free. Well… almost not. As the label explains, a single serving (one single Tic Tac) contains 0g of sugar (to be precise, less than 0.5g, as explained in a footnote). In what could initially be assumed to be a mere coincidence, the weight of a single serving is 0.49g. It did not escape my attention that 0.50-0.49=0.01. Why?

To understand it helps to look at the labeling rules of the FDA. I’ve reproduced the relevant section (Title 21) below, and boldfaced the relevant parts:

TITLE 21–FOOD AND DRUGS

CHAPTER I–FOOD AND DRUG ADMINISTRATION

DEPARTMENT OF HEALTH AND HUMAN SERVICES

SUBCHAPTER B–FOOD FOR HUMAN CONSUMPTION (c)

Sugar content claims–(1)Use of terms such as “Consumers may reasonably be expected to regard terms that represent that the food contains no sugars or sweeteners e.g., “sugar free,” or “no sugar,” as indicating a product which is low in calories or significantly reduced in calories. Consequently, except as provided in paragraph (c)(2) of this section,sugar free,” “free of sugar,” “no sugar,” “zero sugar,” “without sugar,” “sugarless,” “trivial source of sugar,” “negligible source of sugar,” or “dietarily insignificant source of sugar.”a food may not be labeled with such terms unless:(i)

The food contains less than 0.5 g of sugars, as defined in 101.9(c)(6)(ii), per reference amount customarily consumed and per labeled serving or, in the case of a meal product or main dish product, less than 0.5 g of sugars per labeled serving;and(ii) The food contains no ingredient that is a sugar or that is generally understood by consumers to contain sugars

unless the listing of the ingredient in the ingredient statement is followed by an asterisk that refers to the statement below the list of ingredients, which states “adds a trivial amount of sugar,” “adds a negligible amount of sugar,” or “adds a dietarily insignificant amount of sugar;” and(iii)(A) It is labeled “low calorie” or “reduced calorie” or bears a relative claim of special dietary usefulness labeled in compliance with paragraphs (b)(2), (b)(3), (b)(4), or (b)(5) of this section, or, if a dietary supplement, it meets the definition in paragraph (b)(2) of this section for “low calorie” but is prohibited by 101.13(b)(5) and 101.60(a)(4) from bearing the claim; or

(B) Such term is immediately accompanied, each time it is used, by either the statement “not a reduced calorie food,” “not a low calorie food,” or “not for weight control.”

It turns out that **Tic Tacs are in fact almost pure sugar**. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an **arbitrary threshold of the FDA. **Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”. Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at *every* site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou *et al. * The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

To answer the question, it is helpful to consider how one might estimate the abundance of a transcript after digital normalization. One naive approach is to just count the number of reads after de-duplication, followed by normalization for the length of the transcript and the number of reads sequenced. Specifically if there are *n *sites where a read might start, and *k *of the sites had at least one read, then the naive approach would be to use the estimate suitably normalized for the total number of reads in the experiment. This is exactly what is done in standard de-duplication pipelines, or in digital normalization as described in the preprint by Brown *et al.* However assuming a simple model for sequencing, namely that every read is selected by first choosing a transcript according to a multinomial distribution and then choosing a location on it uniformly at random from among the sites, a different formula emerges.

Let *X *be a random variable that denotes the number of sites on a transcript of length *n *that are covered in a random sequencing experiment, where the number of reads starting at each site of the transcript is Poisson distributed with parameter *c *(i.e., the average coverage of the transcript is *c*). Note that

.

The maximum likelihood estimate for *c *can also be obtained by the method of moments, which is to set

from which it is easy to see that

.

This is the same as the (derivation of the) Jukes-Cantor correction in phylogenetics where the method of moments equation is replaced by yielding , but I’ll leave an extended discussion of the Jukes-Cantor model and correction for a future post.

The point here, as noticed by Bo Li, is that since by Taylor approximation, it follows that the average coverage can be estimated by . This is exactly the naive estimate of de-duplication or digital normalization, and the fact that as means that blows up, at high coverage hence the results of Zhou *et al.*

Digital normalization as proposed by Brown *et al.* involves possibly thresholding at more than one read per site (for example choosing a threshold *C* and removing all but at most *C* reads at every site). But even this modified heuristic fails to adequately relate to a probabilistic model of sequencing. One interesting and easy exercise is to consider the second or higher order Taylor approximations. But a more interesting approach to dealing with amplification bias is to avoid thresholding *per se*, and to instead identify outliers among duplicate reads and to adjust them according to an estimated distribution of coverage. This is the approach of Hashimoto *et al.* in a the paper “Universal count correction for high-throughput sequencing” published in March in PLoS One. There are undoubtedly other approaches as well, and in my opinion the issue will received renewed attention in the coming year as the removal of amplification biases in single-cell transcriptome experiments becomes a priority.

As mentioned above, digital normalization/de-duplication is just one of many thresholds applied in a typical RNA-Seq “pipeline”. To get a sense of the extent of thresholding, one need only scan the (supplementary?) methods section of any genomics paper. For example, the GEUVADIS RNA-Seq consortium describe their analysis pipeline as follows:

“We employed the JIP pipeline (T.G. & M.S., data not shown) to map mRNA-seq reads and to quantify mRNA transcripts. For alignment to the human reference genome sequence (GRCh37, autosomes + X + Y + M), we used the GEM mapping suite24 (v1.349 which corresponds to publicly available pre-release 2) to first map (max. mismatches = 4%, max. edit distance = 20%, min. decoded strata = 2 and strata after best = 1) and subsequently to split-map (max.mismatches = 4%, Gencode v12 and de novo junctions) all reads that did not map entirely. Both mapping steps are repeated for reads trimmed 20 nucleotides from their 3′-end, and then for reads trimmed 5 nucleotides from their 5′-end in addition to earlier 3′-trimming—each time considering exclusively reads that have not been mapped in earlier iterations. Finally, all read mappings were assessed with respect to the mate pair information: valid mapping pairs are formed up to a maximum insert size of 100,000 bp, extension trigger = 0.999 and minimum decoded strata = 1. The mapping pipeline and settings are described below and can also be found in https://github.com/gemtools, where the code as well as an example pipeline are hosted.”

This is not a bad pipeline- the paper shows it was carefully evaluated– and it may have been a practical approach to dealing with the large amount of RNA-Seq data in the project. But even the first and seemingly innocuous thresholding to trim low quality bases from the ends of reads is controversial and potentially problematic. In a careful analysis published earlier this year, Matthew MacManes looked carefully at the effect of trimming in RNA-Seq, and concluded that aggressive trimming of bases below Q20, a standard that is frequently employed in pipelines, is problematic. I think his Figure 3, which I’ve reproduced below, is very convincing:

It certainly appears that some mild trimming can be beneficial, but a threshold that is optimal (and more importantly *not detrimental) *depends on the specifics of the dataset and is difficult* *or impossible to determine *a priori.* MacManes’ view (for more see his blog post on the topic) is consistent with another paper by Del Fabbro *et al*. that while seemingly positive about trimming in the abstract, actually concludes that “In the specific case of RNA-Seq, the tradeoff between sensitivity (number of aligned reads) and specificity (number of correctly aligned reads) seems to be always detrimental when trimming the datasets (Figure S2); in such a case, the modern aligners, like Tophat, seem to be able to overcome low quality issues, therefore making trimming unnecessary.”

Alas, Tic Tac thresholds are everywhere. My advice is: brush your teeth *first*.

This is the third and final post in a series (part1, part2) of posts on two back-to-back papers published in Nature Biotechnology in August 2013:

- Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlations,
*Nature Biotechnology***31**(8), 2013, p 720–725. doi:10.1038/nbt.2601 - Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networks,
*Nature Biotechnology***31**(8), 2013, p 726–733. doi:10.1038/nbt.2635

**An inconvenient request**

One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: **he came to request that I remove an arXiv post** of mine, namely “Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately, there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi *et al.*), arXiv preprints are permanent.

My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis?

**Network nonsense begins**

My first review of a Manolis Kellis paper was in the fall of 2006, in my capacity as a program committee member of the * Research in Computational Molecular Biology* (RECOMB) conference held in Oakland, CA in 2007. Because Oakland is right next to Berkeley, a number of Berkeley professors were involved in organizing and running the conference. Terry Speed was chair of the program committee. I was out of the country that year on sabbatical at the University of Oxford, so I could not participate, or even attend, the conference, but I volunteered to serve on the program committee. For those not familiar with the RECOMB review process, it is modeled after the standard Computer Science conferences. The program committee chair forms the program committee, who are then assigned a handful of papers to review and score. Reviews are entered on a website, and after a brief period of online discussion about borderline papers, scores are revised and accept/reject decisions are made. Authors can revise their manuscripts, but the reviewers never see the papers again before publication in the proceedings.

One of the papers I was assigned to review was by a student named Joshua Grochow and his advisor Manolis Kellis. The paper was titled “Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking“. Although networks were not my research focus at the time, and “symmetry-breaking” evoked in me nightmares from the physics underworld, I agreed to the review. The paper seemed to contain some interesting algorithms, appeared to have a combinatorial flavor, and potentially important applications- a good mix for RECOMB.

The problem addressed by Grochow & Kellis was that of identifying “network motifs” in biological networks. “Motifs” can be defined in a variety of ways, and the Grochow-Kellis objective was simple. In graph theoretic terms, given a graph *G*, the goal was to find subgraphs occurring with high multiplicity to an extent unlikely in a random graph. There are many models for random graphs, and the one that the results in Grochow-Kellis are based on is the Erdös-Renyi model (each edge chosen independently with some fixed probability). The reason this definition might be of biological interest, is that recurrent motifs interspersed in a graph are likely to represent evolutionarily conserved interaction modules.

The paper begins with a description of the method. I won’t go into the details here, except to say that everything seemed well until I read the caption of Figure 3. There the number 27,720 caught my eye.

During my first few years of graduate school I took many courses on enumerative and algebraic combinatorics. There are some numbers that combinatorialists just “know”. For example, seeing 42 emerge as the answer to a counting problem does not bring to mind Douglas Adams, but rather the vast literature on Catalan numbers and their connections to dozens of well-known counting problems. Similarly, a number such as 126 brings to mind binomial coefficients (), and with them the idea of counting the number of subsets of fixed size from a larger set. When I saw the number 27,720 I had a hunch that somehow some canonical combinatorial set had been enumerated. The idea may have entered my mind because of the picture of the “motif” in which I immediately recognized a clique (all vertices mutually connected) and a stable set (no pair of vertices connected). In any case, I realized that

.

The significance of this is that the “motif” on the left-hand side of Figure 3 had appeared many times because of a type of double- or rather thousandfold- counting. Instead of representing statistically significant recurring independent motifs, this “motif” arises because of a combinatorial artifact. In the specific example of Figure 3, the motif occurred once for any choice of 4 nodes from the clique of size 9, and any choice of 3 nodes from the stable set of size 12.

**The point is that in a graph, a ny**

**subgraph attached to a large clique (or stable set) will occur many times.**This simple observation is a result of the fact that there are many subgraphs of a clique (or stable set) that are identical. I realized that this meant that the Grochow-Kellis method was essentially a heuristic for finding cliques and stable sets in graphs. The particular “network motifs” they were pulling out were just subgraphs that happened to be connected to some large cliques and stable sets. There are two problems with this: first, a clique or a stable set can hardly be considered an interesting “network motif”. Moreover, the fact that they appear in biological networks much more than in Erdös-Renyi random graphs is not surprising. Second, there is a large literature on finding cliques in graphs, none of which Grochow-Kellis cited or seemed to be familiar with.

The question of the performance of the Grochow-Kellis algorithm is answered in their Figure 3 as well. There is a slightly larger motif consisting of **6 **nodes from the stable set of size 12, instead of 3. That motif occurs in all subsets of the stable set instead of subsets which means that **there is a motif that occurs 116,424 times**! **Grochow and Kellis’s algorithm did not even achieve its stated goal**. It really ought to have outputted the left hand side figure with six nodes in the stable set on the left, and not three. **In other words, this was a paper providing uninteresting solutions from a biological point of view, and doing so poorly to boot.**

I wrote up a detailed report on the paper, and posted it on the RECOMB review website together with poor scores reflecting my opinion that the paper *had *to be rejected. How could RECOMB, ostensibly the premier computer science conference on computational and algorithmic biology, publish a paper with neither a computational nor biological result? Not to mention an algorithm that demonstratably did *not* find the most frequently occurring motif.

As you might already guess, my rejection was subsequently overruled. I don’t know who made the final decision to ** accept** the Grochow & Kellis paper to the RECOMB conference, although presumably it was the program committee chair. The decision jarred with my sense of scientific integrity. I had put considerable effort into reviewing the paper and understanding it, and I felt that I had provided a compelling objective argument for why the paper was fundamentally flawed- the fact that the results were trivial (and incorrect!) was not a subjective statement. At this point I need to point out that the RECOMB conference is quite difficult to get into. The acceptance rate for papers in 2007, consistent with other years, was 21.8%. I knew this meant that even a single very negative review, especially one with a compelling argument against the paper, almost certainly would lead to rejection of the paper. So I couldn’t understand then, nor do I still understand now, on what basis the decision was made to accept the paper. This bothered me greatly, and after much deliberation I started boycotting the conference. Despite publishing five RECOMB papers from 2000 to 2006 and regularly attending the meeting during that time, the continued poor decisions and haphazard standards for papers selected have led me to not return in almost 8 years.

Grochow and Kellis obviously received my review and considered how to “deal with it”. They added a section titled “The role of combinatorial effects”, in which they explained the origins of the number 27,720 that they gleaned from my report, but then spun the bad news they had received as “resulting from combinatorial connectivity patterns prevalent in larger network structures.” They then added that “…this combinatorial clustering effect brings into question the current definition of network motif” and proposed that “additional statistics…might well be suited to identify larger meaningful networks.” This is a lot like someone claiming to discover a bacteria whose DNA is arsenic-based and upon being told by others that the “discovery” is incorrect – in fact, that very bacteria seeks out phosphorous – responding that this is “really helpful” and that it “raises lots of new interesting open questions” about how arsenate gets into cells. Chutzpah. When you discover your work is flawed, the correct response is to retract it.

I don’t think people read papers very carefully. Joshua Grochow went on to win the MIT Charles and Jennifer Johnson Outstanding M. Eng. Thesis Award for his RECOMB work on network motif discovery. [Added February 18: Grochow and Kellis have posted a reply here].

**The nature of man**

I have to admit that after the Grochow-Kellis paper I was a bit skeptical of Kellis’ work. Not because of the paper itself (everyone makes mistakes), but because of the way he responded to my review. So a year and a half ago, when Manolis Kellis published a paper in an area I care about and am involved in, I may have had a negative prior. The paper was Luke Ward and Manolis Kellis “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”, *Science *337 (2012) . Having been involved with the ENCODE pilot, where I contributed to the multiple alignment sub-project, I was curious what comparative genomics insights the full-scale $130 million dollar project revealed. The press releases accompanying the Ward-Kellis paper (e.g. The Nature of Man, The Economist) were suggesting that Ward and Kellis had figured out what makes a human a human; my curiosity was understandably piqued.

Ward and Kellis combined population genomic data from the 1000 Genomes Project with biochemical data from the ENCODE project to look for signatures of human constraint in regulatory elements. Their analysis was based on measuring three different proxies for constraint: SNP density, heterozygosity and derived allele frequency. To identify specific classes of regulatory regions under constraint, aggregated regions associated with specific gene ontology (GO) categories were tested for significance. Reading the paper I was amazed to discover they found precisely two categories: retinal cone cell development and nerve growth factor receptor signaling. It was only upon reading the supplement that I discovered that their tests had produced 53 other GO categories as well (Table S5).

Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, **it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list**. Moreover, there was no obvious explanation for why categories such as chromatin binding (which had a smaller DAF than nerve growth) or protein binding (with the smallest p-value) appeared to be under purifying selection. As with the Feizi *et al. *paper, the supplement produced a story much less clean than the one presented in the main body of the paper. In fact, retinal cone cell development and nerve growth factor were **33 and 34** out of the 55 listed GO categories when sorted by the DAF p-value (42 and 54 when sorted by heterozygosity p-value). In other words, **the story being sold in the paper was based on blatant statistically invalid cherry picking.**

The other result of the paper was an estimate that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% has been subject to lineage-specific constraint. This result was based on adding up the estimates of constrained nucleotides from their Table S6 (using the derived allele frequency measure). These were calculated using a statistic that was computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every feature *F*, the average DAF value *D _{F}* was rescaled to

,

where *D _{CNDC}* and

*D*were the bin-specific average DAFs of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively. One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would have been needed in order to identify human specific constraint. Second, the statistic

_{NCNE}*PUC*was used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there were four values among the confidence intervals for the estimated proportions using DAF that included values less than 0% or above 100%:

_{F }**Ward and Kellis were therefore proposing that some features might have a negative number of nucleotides under constraint**. Moreover, while it is possible that after further rescaling *PUC _{F }*might have correlated with the true proportion of nucleotides under constraint, there was no argument provided in the paper.

**Thus, while Ward and Kellis**.

*claimed*to have estimated the proportion of nucleotides under constraint, they had only computed a statistic named “proportion under constraint”Nicolas Bray and I wrote up these points in a short technical comment and submitted it to the journal Science early in November 2012. The comment was summarily rejected with a curt reply by senior editor Laura Zahn stating that “relative to other Technical Comments we have recently received we feel that the scope and focus of your comment make it more suitable for the Online Comments facility at Science, rather than as a candidate for publication as a Technical Comment.” It is worth noting that Science did decide to publish another comment: Phil Green and Brent Ewing’s, “Comment on’Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions‘”, *Science *10 (2013). Green and Ewing’s comment is *biological *in nature. Their concern is that “… the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.” Its fine that Science decided to host a debate on a biology question on its pages, but **how can one debate the interpretation of results from a method, when the method is fundamentally flawed to begin with? **After all, our problem with PUC was much deeper than a “technical flaw”.

We decided at the end to place the comment in the arXiv. After doing so, it became apparent that it had little impact. Indeed, I have never received any feedback about it from anyone. Apparently even this was too much for Manolis Kellis.

**Methods matter**

By the time I noticed the Feizi *et al.* paper in the journal Nature Biotechnology early in August 2013, my experiences reading Kellis’ papers had subtly altered the dynamic between myself and the printed word. Usually, when I read a paper and I don’t understand something, I assume the fault lies with me. I think most people are like this. But now, when the Feizi *et al.* paper started to not make sense, I didn’t presume the problem was with me. I tried hard to give the paper a fair reading, but after a few paragraphs the spell of the authors was already broken. And so it is that Nicolas Bray and I came to figure out what was really going on in Feizi *et al.*, a project that eventually led us to also look at Barzel-Barabási.

Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable.

Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi *et al.* blog post have started with comments such as

“I don’t have the expertise to judge the math, …”

Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has.

I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians.

The possibilities for computational methods in biology *are *unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter.

Today I came across an arXiv posting from yesterday:

Theoretical bounds on mate-pair information for accurate genome assembly by Henry Lin and Yufeng Shen (arxiv.org/abs/1310.1653, October 7 2013).

It purports to provide a sufficient condition for genome assembly, when in fact it is a rediscovery of Ukkonen’s condition for assembly (and a poor rediscovery at that). I’ve seen this happen before so I thought I’d make a short post about this.

Ukkonen’s paper is

E. Ukkonen, Approximate string matching with q-grams and maximal matches, Theoretical Computer Science 92 (1992), no. 1, 191–211.

The connection to genome assembly is explained well in a recent paper by Guy Bresler, Ma’ayan Bresler and David Tse: Optimal assembly for high-throughput shotgun sequencing (arxiv.org/abs/1301.0068, 18 February 2013).

The following figure from the Bresler *et al.* paper illustrates Ukkonen’s condition:

The condition is that the read length *L* must be at least one greater than the maximum length of any interleaved repeat or triple repeat. In the figure above (Figure 4 in the Bresler *et al.* paper), it is impossible to distinguish the two different assemblies (that merely interchange the sequence between the interleaved repeats) with reads of length *L. *There is a similar example of switching for triple repeats.

The lower bound of Lin and Shen essentially recapitulates the above condition. Moreover, the upper bound is trivial: it merely describes a sufficient library design that if sequenced completely would allow for assembly. Bresler *et al. *go much farther than this, and describe an approach to optimal assembly under a random model of (shotgun) sequencing that in particular, requires shotgun generalizations of Ukkonen’s combinatorial condition. In a future post I will describe in much more detail the Bresler *et al.* results. For now, it suffices to say, Ukkonen should be essential (starting) reading for anyone working on assembly.

**Update** **(October 8**): the authors have graciously contacted me and acknowledged their oversight in not referring to Ukkonen’s work. They will update their manuscript and repost to the arXiv. They also pointed out that their paper discusses specifically paired-end reads, and that this requires a slightly different analysis than Ukkonen’s initial work (a good point that I should have mentioned in the original post). I applaud them for (a) posting the initial result on the arXiv thereby allowing the community to look at their work prior to publication and (b) for responding immediately to improve it given the feedback.

I recently came across a wonderful website of Karl Broman via his homepage. He hosts a list of top ten worst graphs in which he describes examples of badly constructed/displayed graphs from statistical and computational biology papers. He calls out mistakes of others, not in order to humiliate the authors, but rather to educate via example. Its a list every computational biologist should peruse and learn from. The page also has an excellent list of recommended references, including Tufte’s “The visual display of quantitative information” which inspired the Minard plots for transcript abundance shown in Figure 2b and Appendix B of the supplement of my Cufflinks paper.

Interestingly, in addition to the list of graphs, Karl includes a list of tables but with only a single entry (the problem with the table is discussed here). Why should a list of worst graphs have 10 items but a list of tables only one? This is a blatant example of protectionism of tables at the expense of the oft abused graphs. With this post I’d like to remedy this unfair situation and begin the process of assembling a list of top ten worst tables: the tabular hall of shame.

I begin by offering Supplementary Table S6 from the paper “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions” by Luke Ward and Manolis Kellis, Science **337** (2012) 1675–1678:

The table is in support of a main result of the paper, namely that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% is subject to lineage-specific constraint. This result is based on adding up the estimates of constrained nucleotides from Table S6 (using column 10, the derived allele frequency measure). These estimates were calculated using a statistic that is computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every ENCODE feature *F*, the average derived allele frequency value *DF* was rescaled to

,

where *DCNDC *and *DNCNE *are the bin-specific average derived allele frequencies of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively.

One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would be needed in order to identify human specific constraint. Second, the statistic *PUCF* is used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there are four values among the confidence intervals for the estimated proportions using derived allele frequency that include values less than 0% or above 100%, for example:

Ward and Kellis are therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that PUCF might correlate with the true proportion of nucleotides under constraint, there is no argument provided in the paper. Thus, while Ward and Kellis claim to have estimated the proportion of nucleotides under constraint, they have only computed a statistic named “proportion under constraint”.

The bottom line is that a table containing percentages should not have negative entries, and if it does reader beware!

On his worst graphs website Karl provides one example of a what he calls “a really bad table” and here I have offered a second (amazingly Table S5 from the Ward-Kellis paper is also a candidate for the list– Nicolas Bray and I review its sophistry in an arXiv note– but I think every paper should be represented only once on the list). I ask you, the reader, to help me round out the list by submitting more examples in the comments or by email directly to me. Tables from my papers are fair game (notably example #10 of a bad graph on Karl’s list is from one of his own papers). Please help!

I recently completed a term as director of our Center for Computational Biology and one of the things I am proud of having accomplished is helping Brian McClendon, program administrator for the Center, launch a new Ph.D. program in computational biology at UC Berkeley.

The Ph.D. program includes a requirement for taking a new course, “Classics in Computational Biology” (CMPBIO 201), that introduces students to research projects and approaches in computational biology via a survey of classic papers. I taught the class last week and the classic paper I chose was “Comparison of biosequences” by Temple Smith and Michael Waterman, Advances in Applied Mathematics **2 **(1981), p 482–489. I gave a preparatory lecture this past week in which I discussed the Needleman-Wunsch algorithm, followed by leading a class discussion about the Smith-Waterman paper and related topics. This post is an approximate transcription of my lecture on the Needleman-Wunsch algorithm:

The Needleman-Wunsch algorithm was published in “A general method applicable to the search for similarities in the amino acid sequence of two proteins“, Needleman and Wunsch, Journal of Molecular Biology **48** (1970), p 443–453. As with neighbor-joining or UniFrac that I just discussed in my previous blog post, the Needleman-Wunsch algorithm was published without an explanation of its meaning, or even a precise description of the dynamic programming procedure it is based on. In their paper 11 years later, Smith and Waterman write

“In 1970 Needleman and Wunsch [1] introduced their homology (similarity) algorithm. From a mathematical viewpoint, their work lacks rigor and clarity. But their algorithm has become widely used by the biological community for sequence comparisons.”

They go on to state that “The statement in Needleman-Wunsch [1] is less general, unclearly stated, and does not have a proof” (in comparison to their own main Theorem 1). The Theorem Smith and Waterman are referring to explicitly describes the dynamic programming recursion only implicit in Needleman-Wunsch, and then explains why it produces the “optimal” alignment. In hindsight, the Needleman-Wunsch algorithm can be described formally as follows:

Given two sequences *a *and *b *of lengths and respectively, an *alignment* * *is a sequence of pairs:

.

Given an alignment , we let

and

.

In other words, *M* counts the number of matching characters in the alignment, *X* counts the number of mismatches and *S* the number of “spaces”, i.e. characters not aligned to another character in the other sequence. Since every character must be either matched, mismatched, or a not aligned, we have that

.

Given scores (parameters) consisting of real numbers , the Needleman-Wunsch algorithm finds an alignment that maximizes the function . Note that by the linear relationship between *M,X* and *S *described above, there are really only *two free parameters,* and without loss of generality we can assume they are *s *and *x*.* *Furthermore, only *S *and *X *are relevant for computing the score of the alignment; when the scores are provided statistical meaning, one can say that *S *and* X* are sufficient statistics. We call them the *summary* of the alignment.

The specific procedure of the Needleman-Wunsch algorithm is to compute a matrix *S* recursively:

where is the indicator function that is 1 if the characters are the same, and zero otherwise, and an initialization step consists of setting and for all *i,j*. There are numerous generalizations and improvements to the basic algorithm, both in extending the scoring function to include more parameters (although most generalizations retain the linearity assumption), and algorithms for trading off time for space improvements (e.g. divide-and-conquer can be used to improve the space requirements from to ).

An important advance in understanding the Needleman-Wunsch algorithm was the realization that the “scores” *m,x* and *s* can be interpreted as logarithms of probabilities if the sign on the parameters is the same, or as log-odds ratios if the signs are mixed. This is discussed in detail in the book “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Sequences” by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison.

One insight from the statistical point of view is that if *max* and *plus* in the recursion are replaced with *plus *and *times*, and the logarithms of probabilities are replaced by the probabilities themselves, then the Needleman-Wunsch recursion computes the total probability of the pairs of sequences marginalized over alignments (i.e., it is the forward algorithm for a suitably defined hidden Markov model). Together with the backward algorithm, which consists of running the forward algorithm on the reversed sequences, this provides an efficient way to compute for every pair of nucleotides *the posterior probability that they are aligned. *In other words, there is a meaningful interpretation and useful application for the Needleman-Wunsch algorithm with the semi-ring replaced with .

In work in collaboration with Bernd Sturmfels, we realized that it also makes sense to run the Needleman-Wunsch algorithm with the polytope algebra . Here is the set of convex polytopes in for some *d, *polytopes are “added” using the geometric operation of the convex hull of the union of polytopes, and they are “multiplied” by taking Minkowski sum. The result allows one to find not only a single optimal alignment, but *all* optimal alignments. The details are discussed in a pair of papers:

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138–16143.

C. Dewey, P. Huggins, K, Woods, B. Sturmfels and L. Pachter, Parametric alignment of Drosophila genomes, PLoS Computational Biology, Volume 2, Number 6 (2006) p e73.

The first explains the mathematical foundations for using the result and is where we coined the term *polytope propagation algorithm* for the Needleman-Wunsch algorithm with the polytope algebra. The second paper illustrates an application to a whole-genome alignment of two Drosophilids and provides an alternative to polytope propagation (the beneath-beyond algorithm) that is much faster. The type of analysis performed in the paper is also called *parametric alignment*, a problem with a long history whose popularity was launched by Dan Gusfield who provided the first software program for parametric alignment called XPARAL.

The polytope propagation algorithm (Needleman-Wunsch with the polytope algebra) is illustrated in the figure below:

The convex polygon in the bottom right panel contains in its interior points corresponding to all possible summaries for the alignments of the two sequences. The optimal summaries are vertices of the polygon. For each vertex, the parameters for which the Needleman-Wunsch algorithm will produce a summary corresponding to that vertex form a cone. The cones together form a fan which is shown inside the polygon.

The running time of polytope propagation depends on the number of vertices in the final polytope. In our papers we provide bounds showing that polytope propagation is extremely efficient. In particular, for *d *parameters, the number of vertices is at most . Cynthia Vinzant, formerly in our math department at UC Berkeley, has written a nice paper on lower bounds.

The fact that the Needleman-Wunsch algorithm can be run with three semi-rings means that it is particularly well suited to C++ thanks to templates that allow for the variables in the recursion to be abstracted. This idea (and code) is thanks to my former student Colin Dewey:

`template<typename SemiRing>`

```
void
alignGlobalLastRow(const string& seq1,
const string& seq2,
const typename SemiRing::Element match,
const typename SemiRing::Element mismatch,
const typename SemiRing::Element gap,
vector& row);
const Element one = SemiRing::multiplicativeIdentity;
// Initialize row
row.resize(seq2.size() + 1);
row[0] = one;
// Calculate first row
for (size_t j = 1; j <= seq2.size(); ++j)
row[j] = gap * row[j - 1];
// Calculate remaining rows
Element up, diag;
for (size_t i = 1; i <= seq1.size(); ++i) {
diag = row[0];
row[0] *= gap;
for (size_t j = 1; j <= seq2.size(); ++j) {
up = row[j];
if (seq1[i - 1] == seq2[j - 1]) {
row[j] = match * diag + gap * (up + row[j - 1]);
} else {
row[j] = mismatch * diag + gap * (up + row[j - 1]);
}
diag = up;
}
}
```

To recap, the three semi-rings with which it makes sense to run this algorithm are:

The interpretation of the output of the algorithm with the three semi-rings is:

- is the score (probability) of a maximum alignment between the
*i*th prefix of one sequence and the*j*th prefix of the other. - is the total score (probability) of all alignments between the
*i*th prefix of one sequence and the*j*th prefix of the other. - is a polytope whose faces correspond to summaries that are optimal for some set of parameters.

The semi-rings in (2) and (3) lead to particularly interesting and useful applications of the Needleman-Wunsch algorithm. An example of the use of (2), is in *posterior* *decoding* based alignment where nucleotides are aligned as to maximize the sum of the posterior probabilities computed by (2). In the paper “Alignment Metric Accuracy” (with Ariel Schwartz and Gene Myers) we provide an interpretation in terms of an interesting metric on sequence alignments and in

A.S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics 23 (2007), e24–e29

we show how to adapt posterior decoding into a (greedy) multiple alignment algorithm. The video below shows how it works, with each step consisting of aligning a single pair of nucleotides from some pair of sequences using posterior probabilities computed using the Needleman-Wunsch algorithm with semi-ring (2):

We later extended this idea in work together with my former student Robert Bradley in what became FSA published in the paper “Fast Statistical Alignment“, R. Bradley et al., PLoS Computational Biology **5** (2009), e1000392.

An interesting example of the application of Needleman-Wunsch with semi-ring (3), i.e. polytope propagation, is provided in the recent paper “The RNA Newton polytope and learnability of energy parameters” by Elmirasadat Forouzmand and Hamidreza Chitsaz, Bioinformatics **29** (2013), i300–i307. Using our polytope propagation algorithm, they investigate whether simple models for RNA folding can produce, for some set of parameters, the folds of sequences with solved structures (the answer is sometimes, but not always).

It is remarkable that forty three years after the publication of the Needleman-Wunsch paper, and thirty two years after the publication of the Smith-Waterman paper, the algorithms remain pertinent, widely used, and continue to reveal new secrets. True classics.

Where are the authors now?

Saul B. Needleman writes about numismatics.

Christian Wunsch is a clinical pathologist in the University of Miami health system.

Temple Smith is Professor Emeritus of Biomedical Engineering at Boston University.

Michael Waterman is Professor of Biological Sciences, Mathematics and Computer Science at the University of Southern California.

Don’t believe the anti-hype. They are saying that RNA-Seq *promises* the discovery of new expression events, but it doesn’t *deliver*:

Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):

The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide *transcript level resolution* whereas RNA-Seq *can’t*!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.

The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:

But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what *is* used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:

The maximum likelihood (ML) abundances are given by

In these equations is the length of transcript *t *(if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally is the number of reads mapping to transcript *t* while *N* is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances () scaled by some factors.

RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in ) are *fragments*. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the which are in many cases tiny numbers since they must all sum to 1.

Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the length terms removed from the denominators. Therefore RPM is proportional to the . If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).

To be fair to Affymetrix the conversion between the and the can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:

Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.

Having said this, it *is* true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.

There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq. Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.

For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):

There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104 show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.

There is a new arXiv paper out with the title **Sailfish: Alignment-free Isoform Quantification from RNA-Seq Reads using Lightweight Algorithms **by Rob Patro, Stephen M. Mount and Carl Kingsford. It describes a new approach to RNA-Seq quantification that is based on directly estimating abundances from *k-*mers rather than read alignments. This is an interesting approach, because it avoids the time-intensive read alignment step that is rapidly becoming a bottleneck in RNA-Seq analysis. The idea of avoiding read alignments to reference genomes/transcriptome in *Seq experiments is being explored in other contexts as well, such as for mutant mapping (by the Korbinian Schneeberger group) and genotyping (by the Gil McVean group). I am particularly interested in these ideas as we have been exploring such methods for association mapping.

Patro, Mount and Kingsford work with a simplified model for RNA-Seq to first obtain approximate transcript abundance estimates. In the notation of my survey paper on RNA-Seq models (see equation 14, except with *k* replaced by *j *to avoid confusion), they are maximizing the likelihood

where the product is over *k-*mers instead of reads, so that (where *k *is the *k*-mer size) rather than the total number of reads. The EM updates are therefore the same as those of other RNA-Seq quantification algorithms (see Figure 4 in my survey). They also implement an acceleration of the EM called SQUAREM (by Varadhan and Roland) in order to improve convergence.

The results of the paper are impressive. They compare speed and accuracy with RSEM, Cufflinks and eXpress and obtain comparable accuracy while avoiding the time intensive alignment of reads to transcripts (or the genome in the case of Cufflinks). An interesting point made is that bias can be corrected after fragment assignment (or in the case of Sailfish after *k*-mer assignment) without much loss in accuracy. We used a similar approximation in eXpress, namely stopping estimation of bias parameters after 5 million reads have been processed, but it seems that postponing the entire correction until fragment assignment is complete is acceptable.

Sailfish also appears to have been well engineered. The code (in C++) is well documented and available in both source and executable (for Linux and Mac OS X). I haven’t had a chance to test it yet but hope to do so soon. ~~My only concern with the manuscript is that the simulation results for eXpress appear to be unreasonable. The experiments conducted on “real data” (for which there appear to be qPCR) suggest that the accuracy of Sailfish is similar to that of eXpress, RSEM and Cufflinks (with RSEM underperforming slightly presumably to the lack of bias correction), but the simulations, performed with the Flux Simulator, are inconsistent. I suspect there is a (trivial) problem with the simulated data and presumably the authors will address this before journal publication.~~ Update: The authors responded to my blog post within a day and we quickly realized the problem was likely to have been that Flux Simulator did not output reads in random order. Random ordering of reads is important for eXpress to function correctly, and when we wrote our paper we verified that Illumina sequencers do indeed output reads in random order. Rob Patro shuffled the Flux Simulator reads and verified that the performance of eXpress on simulated data is consistent with the results on real data (see attached figure). We’re grateful for his quick work in sorting out the issue and thank the authors of Sailfish for posting their paper on the arXiv (as others are starting to do), thereby enabling this exchange to occur prior to publication.

## Recent Comments