When the organizers of ISMB 2013 kindly invited me to give a keynote presentation this year, I decided to use the opportunity to survey “sequence census” methods. These are functional genomics assays based on high throughput sequencing. It has become customary to append the suffix “-Seq” to such assays (e.g. RNA-Seq), and I therefore prefer the term *Seq where the * denotes a wildcard.
The starting point for preparing the talk was a molecular biology seminar I organized in the Spring of 2010, where we discussed new high-throughput sequencing based assays with a focus on the diverse range of applications being explored. At the time I had performed a brief literature search to find *Seq papers for students to present, and this was helpful as a starting point for building a more complete bibliography for my talk. Finding *Seq assays is not easy- there is no central repository for them- but after some work I put together a (likely incomplete) bibliography that is appended to the end of the post. Update: I’ve created a page for actively maintaining a bibliography of *Seq assays.
The goal for my talk was to distill what appears to be a plethora of complex and seemingly unrelated experiments (see, e.g., Drukier et al. on the *Seq list) into a descriptive framework useful for thinking about their commonalities and differences. I came up with this cartoonish figure that I briefly explain in the remainder of this post. In future posts I plan to review in more detail some of these papers and the research they have enabled.
A typical assay first involves thinking of a (molecular) measurement to be made. The problem of making the measurement is then “reduced” (in the computer science sense of the word) to sequencing. This means that the measurement will be inferred from sequencing bits of DNA from “target” sequences (created during the reduction), and then counting the resulting fragments. It is important to keep in mind that the small fragments of DNA are sampled randomly from the targets, but the sampling may not be uniform.
The inference step is represented in the “Solve inverse problem” box in the figure, and involves developing a model of the experiment, together with an algorithm for inferring the desired measurement from the data (the sequenced DNA reads). Finally, the measurement becomes a starting point for further (computational) biology inquiry.
The canonical example of a *Seq assay is RNA-Seq. It was first published in two separate papers (Mortazavi et al., Nature Methods, 2008, and Nagalakshmi et al., Science 2008). The term RNA-Seq now encapsulates numerous variants of the original idea, but at its core it involves the sequencing of cDNA fragments made from RNA for the purpose of quantifying the abundances of transcripts in a transcriptome (and possibly also assembling the transcripts). Thus, the desired measurement in RNA-Seq is RNA abundance, the reduction to sequencing simply involves reverse transcribing RNA to cDNA and therefore the cDNA constitutes the targets, and the data (samples of cDNA) are the basis for inferring abundances.
In my ISMB talk I focused on two subproblems that constitute inference in any *Seq assay:
- Fragment assignment: this is the problem of determining the origin of sequenced DNA fragments. In the case of RNA-Seq, it is the problem of assigning reads to transcripts. Difficulties may arise when reads map ambiguously to different target sequences, or to different locations within a single target.
- Density deconvolution: this is the problem of determining, for a given location in a target sequence (specifically a transcript in the case of RNA-Seq), the number of reads that would have originated from that location if the sampling from targets was uniform.
Fragment assignment and density deconvolution are obviously related problems, and the solution of one is necessary for solving the other. Every *Seq assay I am aware of involves solving both these problems, although one may be easier than the other depending on the specific application. In the case of RNA-Seq, fragment assignment turns out to be crucial for quantifying abundance accurately, and for performing differential expression. The problem was already discussed in the Mortazavi et al. paper where multi-mapping reads are assigned using the RESCUE algorithm. In retrospect this turned out to be equivalent to the EM algorithm, that is now used in numerous RNA-Seq analysis tools. Density deconvolution in RNA-Seq is required in order to remove bias caused by priming and other technological issues. It was first mentioned as a key issue in Hansen et al., Nucleic Acids Research, 2010, and a model for correcting for it was published in Roberts et al., Genome Biology 2011.
In my ISMB talk I provided different examples of how fragment assignment and density deconvolution arise in *Seq assays, and discussed briefly the statistics involved in solving the inference problems. The slides are available here (in Keynote) and here (PDF) and ISMB has made a video of the talk available (free registration required). I am also maintaining a page with a list of all *Seq assays.