There is a new arXiv paper out with the title Sailfish: Alignment-free Isoform Quantification from RNA-Seq Reads using Lightweight Algorithms by Rob Patro, Stephen M. Mount and Carl Kingsford. It describes a new approach to RNA-Seq quantification that is based on directly estimating abundances from k-mers rather than read alignments. This is an interesting approach, because it avoids the time-intensive read alignment step that is rapidly becoming a bottleneck in RNA-Seq analysis. The idea of avoiding read alignments to reference genomes/transcriptome in *Seq experiments is being explored in other contexts as well, such as for mutant mapping (by the Korbinian Schneeberger group) and genotyping (by the Gil McVean group). I am particularly interested in these ideas as we have been exploring such methods for association mapping.

Patro, Mount and Kingsford work with a  simplified model for RNA-Seq to first obtain approximate transcript abundance estimates. In the notation of my survey paper on RNA-Seq models (see equation 14, except with k replaced by to avoid confusion), they are maximizing the likelihood

$L(\rho) = \prod_{i=1}^N \left( \sum_{j=1}^K y_{i,j} \frac{\alpha_j}{l_j} \right)$

where the product is over k-mers instead of reads, so that $N=4^k$ (where is the k-mer size) rather than the total number of reads. The EM updates are therefore the same as those of other RNA-Seq quantification algorithms (see Figure 4 in my survey). They also implement an acceleration of the EM called SQUAREM (by Varadhan and Roland) in order to improve convergence.

The results of the paper are impressive. They compare speed and accuracy with RSEM, Cufflinks and eXpress and obtain comparable accuracy while avoiding the time intensive alignment of reads to transcripts (or the genome in the case of Cufflinks). An interesting point made is that bias can be corrected after fragment assignment (or in the case of Sailfish after k-mer assignment) without much loss in accuracy. We used a similar approximation in eXpress, namely stopping estimation of bias parameters after 5 million reads have been processed, but it seems that postponing the entire correction until fragment assignment is complete is acceptable.

Sailfish also appears to have been well engineered. The code (in C++) is well documented and available in both source and executable (for Linux and Mac OS X). I haven’t had a chance to test it yet but hope to do so soon. My only concern with the manuscript is that the simulation results for eXpress appear to be unreasonable. The experiments conducted on “real data” (for which there appear to be qPCR) suggest that the accuracy of Sailfish is similar to that of eXpress, RSEM and Cufflinks (with RSEM underperforming slightly presumably to the lack of bias correction), but the simulations, performed with the Flux Simulator, are inconsistent. I suspect there is a (trivial) problem with the simulated data and presumably the authors will address this before journal publication. Update: The authors responded to my blog post within a day and we quickly realized the problem was likely to have been that Flux Simulator did not output reads in random order. Random ordering of reads is important for eXpress to function correctly, and when we wrote our paper we verified that Illumina sequencers do indeed output reads in random order. Rob Patro shuffled the Flux Simulator reads and verified that the performance of eXpress on simulated data is consistent with the results on real data (see attached figure). We’re grateful for his quick work in sorting out the issue and thank the authors of Sailfish for posting their paper on the arXiv (as others are starting to do), thereby enabling this exchange to occur prior to publication.