Hui Jiang and Julia Salzman have posted a new paper on the arXiv proposing a novel approach to correcting for non-uniform coverage of transcripts in RNA-Seq: “A penalized likelihood approach for robust estimation of isoform expression” (October 1, 2013).

Their paper addresses the issue of non-uniformity of read coverage across transcripts in RNA-Seq, an issue that is frustrating for the challenges it presents in analysis. The non-uniformity of read coverage in RNA-Seq was first noticed in A. Mortazavi et al., Mapping and quantifying mammalian transcriptomes, Nature Methods 5 (2008), 621–628. Figure 1 in the paper (see below) shows an example of non-uniform coverage, and the paper discusses ideas for library preparation that can reduce bias and improve uniformity.

Mortazavi_Fig1b

Figure 1b from Mortazavi et al. (2008) showing (non-uniform) coverage of Myf6.

Supp_1a_MortazaviSupplementary Figure 1a from Mortazavi et al. (2008) describing uniformity of coverage achievable with different library preparations. “Deviation from uniformity” was assessed using the Kolmogorov-Smirnov test.

The experimental approach of modifying library preparation to reduce non-uniformity has been complemented by statistical approaches to the problem. Specifically, various models have been proposed for “correcting” for experimental artefacts that induce non-uniform coverage. To understand Jiang and Salzman’s latest paper, it is helpful to review previous approaches that have been proposed.

First, it is important to realize that although the emphasis on non-uniformity has been on read coverage, RNA-Seq fragments have different lengths and there can also be deviations from the (currently accepted naïve) model in which fragment lengths are chosen at random from a fixed distribution. I think a better picture than the frequently displayed coverage plots (as in the figure above) is to display fragment start sites and fragment lengths as points in the plane, an idea I proposed with Valerie Hower and Steven Evans in V. Hower et al. Coverage statistics for sequence census methods, BMC Bioinformatics 11 (2010), 430:

tlpic

Figure 6 from Evans et al. (2010). Panel A shows fragments covering a gene where every point represents a single fragment with x-coordinate showing where it starts and y-coordinate its length. Panel B shows a simulation with the same fragment length distribution but with starting positions sampled uniformly at random.

The point of our plots is that under the assumption that coverage is uniform in position and that fragment lengths are chosen independently at random, the collection of points arise from a non-homogeneous spatial Poisson process (Theorem 1 in our paper). This allows for testing for more than just uniformity in coverage and we published a tool for this called ReadSpy, described in V. Hower, R. Starfield, A. Roberts and L. Pachter, Quantifying uniformity of mapped reads, Bioinformatics 28 (2012), 2860–2862.

The first model-based correction for non-uniform read coverage was proposed in Jun Li, Hui Jiang and Wing Wong, Modeling non-uniformity in short-read rates in RNA-Seq data, Genome Biology, 11 (2010), R50.   The proposed model was a  modification of the standard model for RNA-Seq, specifically setting a Poisson rate of \mu_{ij} for the number of counts produced by gene in nucleotide j according to

log(\mu_{ij}) = v_i + \alpha + \sum_{k=1}^K \sum_{h \in \{A,C,G,T\}} \beta_{kh} I(\beta_{ijk} = h).

The non-uniformity is achieved by modeling “sequencing preference” at the position, where the preference is modeled by the \beta variables (represents a nucleotide in a window of length K around the start of the read). The model was motivated by a careful analysis of the Mortazavi et al. data, showing reproducible deviations from uniformity associated to sequence (illustrated in their Figure 1, shown below).

Non-uniformity

Non-uniformity of coverage in RNA-Seq. Example fromLi, Jiang and Wong (2010). The figure shows coverage of ApoE (data from Mortazavi et al.) in three different tissues.

The Li et al. paper was preceded by a few days by the paper Kasper Hansen, Sandrine Dudoit and Steven Brenner, Biases in Illumina transcriptome sequencing caused by random hexamer priming, Nucleic Acids Research 38 (2010), e131. The Hansen et al. postulated a mechanism for “sequence preference” leading to non-uniformity of coverage, namely the random hexamer priming used in the standard protocol. The proposed mechanism remains a plausible candidate for contributing to non-uniformity, although a detailed biochemical explanation of non-uniform coverage in RNA-Seq remains elusive.

Hansen et al. also proposed a correction, albeit a heuristic that (in later papers) was shown to be inferior to the approach of Li et al. and others. Specifically, they suggest re-weighting reads (and thereby adjusting total counts by the weighted averages) according to the formula

w(h) = \frac{\frac{1}{6} \sum_{i=24}^{29}\hat{p}_{hep:i}(h)}{\frac{1}{2}(\hat{p}_{hep:1}(h)+\hat{p}_{hep:2}(h)}

where h is a heptamer, w(h) is its weight, \hat{p}_{hep:i}(h) is the proportion of reads which have heptamer h at position and \hat{p}_{hep:1}(h) is the proportion of reads starting with heptamer h. This intuition motivating the formula is to  use the sequences in the middle of reads as a proxy for the expected number of heptamers and then to weight the reads by the ratio of expected to observed heptamers. In A. Roberts et al., Improving RNA-Seq expression estimates by correcting for fragment bias, Genome Biology 12 (2011), R22, we showed that, as expected, a model based approach is significantly better than the heuristic of Hansen et al. (this observation was confirmed in Bo Li and Colin Dewey‘s paper on RSEM). The details of the Roberts et al. correction are best explained in Adam Roberts‘ thesis that I will blog about shortly.

To understand Jiang and Salzman’s latest innovation, it is helpful to examine a previous paper of theirs

Jiang and Salzman’s innovation is to extend the standard likelihood function for RNA-Seq according to (in their notation):

l(\theta,b;n,A)= \sum_{j=1}^J \{ n_j ln ( \sum_{i=1}^{I} \theta_i a_{ij} e^{b_j} ) - \sum_{i=1}^I \theta_i a_{ij} e^{b_j} \} - \lambda \sum_{j=1}^{J} |b_j|.

In this formulation a key notion is that of “read type”, introduced in previous work of Salzman, Jiang, and Wong. The idea is to collapse reads that have the same probability of being generated from a transcript, and in fact further collapsing is possible into “categories” which we do not define here (this same idea that has been recently utilized in the SailFish project). The “types” of reads are indexed by j. The index ranges over transcripts, and a_{ij} is the rate at which read type is sampled from transcript i. The abundance parameters are the \theta_i and finally the b_j are the new parameters for modeling non-uniformity. The case of uniform-at-random sequencing corresponds to $b_j=1$ and the model then reduces to the standard RNA-Seq model. However if the b_j are allowed to be arbitrary parameters, and the penalty $\lambda$ is set to 0, then it is clear that the model is not identifiable. Thus, the proposal is to set \lambda to be non-zero, thereby effectively choosing read types for which there is bias in sequencing.

The paper provides an algorithm for estimating parameters in this penalized likelihood framework, and is well-written. I particularly liked the elementary Proposition 2.4 (the mean of the absolute value of a set of numbers is always greater than or equal to the mean of the absolute value of the median-centered numbers) which I’d never thought of before, and is a useful step in the procedure. In addition to providing an optimization procedure, Jiang and Salzman propose a two-step procedure in which the penalized optimization is used only to identify biased read categories (non-zero b_js) to be discarded as outliers, so that the standard likelihood-based estimation can follow on the remaining categories. They provide conditions under which this two-step procedure is consistent, however I am somewhat skeptic of the thesis of the two-step approach, namely that it is desirable to discard outliers rather than to fully model all read types as in the Roberts et al. paper. The best approach to use in practice is not resolved in the paper, as the method is only described on a small example, rather than in a transcriptome-scale benchmark.

There is one interesting point made in the paper that I think is important to highlight: Figure 1 in the paper shows an example where non-uniform coverage is likely due to missing annotation:

MED15

Figure 1 from Jiang and Salzman showing RNA-Seq coverage (75bp single end reads) for the MED15 gene. The two transcripts are the RefSeq isoforms. Many more transcripts are annotated in the UCSC genes, see below:

UCSC_Med15

Of course, in RNA-Seq experiments RefSeq is never a good annotation to use– the more complete UCSC genes or ENSEMBL are always preferable. However the point about missing isoforms is well-taken, and is something Harold Pimentel and I have recently written about in a review of differential expression we are completing. Missing transcripts can affect quantification, and the Jiang-Salzman method could, in principle, remove the outlier reads before quantification. An alternative approach is to assemble transcripts directly from the RNA-Seq before quantification, as recommended in my original Trapnell et al. 2010 Cufflinks paper (unfortunately the assembly approach is unlikely to be effective for single-end reads).

It is important to note that the bias that is the focus of the Jiang-Salzman paper, and that I’ve focused on in this post, is not the only type of effect observed in RNA-Seq. Positional bias refers to non-uniformity associated with relative 5′–3′ position in transcripts, and there are many other types of effects reported, including biases associated with GC content and base composition (the paper Wei Zheng, Lisa M Chung and Hongyu Zhao, Bias detection and correction in RNA-Sequencing data, BMC Bioinformatics 12 (2011), 290 is a good starting point for description and modeling of some of these issues). However technological advances have mitigated many of these issues, while the type of sequence-specific bias we have addressed appears to be a continuing problem for RNA-Seq.

In summary, the Jiang-Salzman paper has some interesting ideas and is well-written. Although it doesn’t focus on software implementation and transcriptome-wide benchmarking, I believe there is a great need for more methods papers such as this in RNA-Seq.