You are currently browsing the category archive for the ‘talks’ category.

Bi/BE/CS183 is a computational biology class at Caltech with a mix of undergraduate and graduate students. Matt Thomson and I are co-teaching the class this quarter with help from teaching assistants Eduardo Beltrame, Dongyi (Lambda) Lu and Jialong Jiang. The class has a focus on the computational biology of single-cell RNA-seq analysis, and as such we recently taught an introduction to single-cell RNA-seq technologies. We thought the slides used would be useful to others so we have published them on figshare:

Eduardo Beltrame, Jase Gehring, Valentine Svensson, Dongyi Lu, Jialong Jiang, Matt Thomson and Lior Pachter, Introduction to single-cell RNA-seq technologies, 2019. doi.org/10.6084/m9.figshare.7704659.v1

Thanks to Eduardo Beltrame, Jase Gehring and Valentine Svensson for many extensive and helpful discussions that clarified many of the key concepts. Eduardo Beltrame and Valentine Svensson performed new analysis (slide 28) and Jase Gehring resolved the tangle of “doublet” literature (slides 17–25). The 31 slides were presented in a 1.5 hour lecture. Some accompanying notes that might be helpful to anyone interested in using them are included for reference (numbered according to slide) below:

1. The first (title) slide makes the point that single-cell RNA-seq is sufficiently complicated that a deep understanding of the details of the technologies, and methods for analysis, is required. The #methodsmatter.
2. The second slide presents an overview of attributes associated with what one might imagine would constitute an ideal single-cell RNA-seq experiment. We tried to be careful about terminology and jargon, and therefore widely used terms are italicized and boldfaced.
3. This slide presents Figure 1 from Svensson et al. 2018. This is an excellent perspective that highlights key technological developments that have spurred the growth in popularity of single-cell RNA-seq. At this time (February 2019) the largest single-cell RNA-seq dataset that has been published consists of 690,000 Drop-seq adult mouse brain cells (Saunders, Macosko et al. 2018). Notably, the size and complexity of this dataset rivals that of a large-scale genome project that until recently would be undertaken by hundreds of researchers. The rapid adoption of single-cell RNA-seq is evident in the growth of records in public sequence databases.
4. The Chen et al. 2018 review on single-cell RNA-seq is an exceptionally useful and thorough review that is essential reading in the field. The slide shows Figure 2 which is rich in information and summarizes some of the technical aspects of single-cell RNA-seq technologies. Understanding of the details of individual protocols is essential to evaluating and assessing the strengths and weaknesses of different technologies for specific applications.
5. Current single-cell RNA-seq technologies can be broadly classified into two groups: well-based and droplet-based technologies. The Papalexi and Satija 2017 review provides a useful high-level overview and this slide shows a part of Figure 1 from the review.
6. The details of the SMART-Seq2 protocol are crucial for understanding the technology. SMART is a clever acronym for Switching Mechanism At the 5′ end of the RNA Transcript. It allows the addition of an arbitrary primer sequence at the 5′ end of a cDNA strand, and thus makes full length cDNA PCR possible. It relies on the peculiar properties of the reverse transcriptase from the Moloney murine leukemia virus (MMLV), which, upon reaching the 5’ end of the template, will add a few extra nucleotides (usually Cytosines). The resultant overhang is a binding site for the “template switch oligo”, which contains three riboguanines (rGrGrG). Upon annealing, the reverse transcriptase “switches” templates, and continues transcribing the DNA oligo, thus adding a constant sequence to the 5’ end of the cDNA. After a few cycles of PCR, the full length cDNA generated is too long for Illumina sequencing (where a maximum length of 800bp is desirable). To chop it up into smaller fragments of appropriate size while simultaneously adding the necessary Illumina adapter sequences, one can can use the Illumina tagmentation Nextera™ kits based on Tn5 tagmentation. The SMART template switching idea is also used in the Drop-seq and 10x genomics technologies.
7. While it is difficult to rate technologies exactly on specific metrics, it is possible to identify strengths and weaknesses of distinct approaches. The SMART-Seq2 technology has a major advantage in that it produces reads from across transcripts, thereby providing “full-length” information that can be used to quantify individual isoforms of genes. However this superior isoform resolution requires more sequencing, and as a result makes the method less cost effective. Well-based methods, in general, are not as scalable as droplet methods in terms of numbers of cells assayed. Nevertheless, the tradeoffs are complex. For example robotics technologies can be used to parallelize well-based technologies, thereby increasing throughput.
8. The cost of a single-cell technology is difficult to quantify. Costs depend on number of cells assayed as well as number of reads sequenced, and different technologies have differing needs in terms of reagents and library preparation costs. Ziegenhain et al. 2017 provide an in-depth discussion of how to assess cost in terms of accuracy and power, and the table shown in the slide is reproduced from part of Table 1 in the paper.
9. A major determinant of single-cell RNA-seq cost is sequencing cost. This slide shows sequencing costs at the UC Davis Genome Center and its purpose is to illustrate numerous tradeoffs, relating to throughput, cost per base, and cost per fragment that must be considered when selecting which sequencing machine to use. In addition, sequencing time frequently depends on core facilities or 3rd party providers multiplexing samples on machines, and some sequencing choices are likely to face more delay than others.
10. Turning to droplet technologies based on microfluidics, two key papers are the Drop-seq and inDrops papers which were published in the same issue of a single journal in 2015. The papers went to great lengths to document the respective technologies, and to provide numerous text and video tutorials to facilitate adoption by biologists. Arguably, this emphasis on usability (and not just reproducibility) played a major role in the rapid adoption of single-cell RNA-seq by many labs over the past three years. Two other references on the slide point to pioneering microfluidics work by Rustem Ismagilov, David Weitz and their collaborators that made possible the numerous microfluidic single-cell applications that have since been developed.
11.  This slide displays a figure showing a monodispersed emulsion from the recently posted preprint “Design principles for open source bioinstrumentation: the poseidon syringe pump system as an example” by Booeshaghi et al., 2019. The generation of such emulsions is a prerequisite for successful droplet-based single-cell RNA-seq. In droplet based single-cell RNA-seq, emulsions act as “parallelizing agents”, essentially making use of droplets to parallelize the biochemical reactions needed to capture transcriptomic (or other) information from single-cells.
12. The three objects that are central to droplet based single-cell RNA-seq are beads, cells and droplets. The relevance of emulsions in connection to these objects is that the basis of droplet methods for single-cell RNA-seq is the encapsulation of single cells together with single beads in the emulsion droplest. The beads are “barcode” delivery vectors. Barcodes are DNA sequences that are associated to transcripts, which are accessible after cell lysis in droplets. Therefore, beads must be manufactured in a way that ensures that each bead is coated with the same barcodes, but that the barcodes associated with two distinct beads are different from each other.
13. The inDrops approach to single-cell RNA-seq serves as a useful model for droplet based single-cell RNA-seq methods. The figure in the slide is from a protocol paper by Zilionis et al. 2017 and provides a useful overview of inDrops. In panel (a) one sees a zoom-in of droplets being generated in a microfluidic device, with channels delivering cells and beads highlighted. Panel (b) illustrates how inDrops hydrogel beads are used once inside droplets: barcodes (DNA oligos together with appropriate priming sequences) are released from the hydrogel beads and allow for cell barcoded cDNA synthesis. Finally, panel (c) shows the sequence construct of oligos on the beads.
14. This slide is analogous to slide 6, and shows an overview of the protocols that need to be followed both to make the hydrogel beads used for inDrops, and the inDrops protocol itself.  In a clever dual use of microfluidics, inDrops makes the hydrogel beads in an emulsion. Of note in the inDrops protocol itself is the fact that it is what is termed a “3′ protocol”. This means that the library, in addition to containing barcode and other auxiliary sequence, contains sequence only from 3′ ends of transcripts (seen in grey in the figure). This is the case also with other droplet based single-cell RNA-seq technologies such as Drop-seq or 10X Genomics technology.
15. The significance of 3′ protocols it is difficult to quantify individual isoforms of genes from the data they produce. This is because many transcripts, while differing in internal exon structure, will share a 3′ UTR. Nevertheless, in exploratory work aimed at investigating the information content delivered by 3′ protocols, Ntranos et al. 2019 show that there is a much greater diversity of 3′ UTRs in the genome than is currently annotated, and this can be taken advantage of to (sometimes) measure isoform dynamics with 3′ protocols.
16. To analyze the various performance metrics of a technology such as inDrops it is necessary to understand some of the underlying statistics and combinatorics of beads, cells and drops. Two simple modeling assumptions that can be made is that the number of cells and beads in droplets are each Poisson distributed (albeit with different rate parameters). Specifically, we assume that
$\mathbb{P}(\mbox{droplet has } k \mbox{ cells}) = \frac{e^{-\lambda}\lambda^k}{k!}$ and $\mathbb{P}(\mbox{droplet has } k \mbox{ beads}) = \frac{e^{-\mu}\mu^j}{j!}$. These assumptions are reasonable for the Drop-seq technology. Adjustment of concentrations and flow rates of beads and cells and oil allows for controlling the rate parameters of these distributions and as a result allow for controlling numerous tradeoffs which are discussed next.
17. The cell capture rate of a technology is the fraction of input cells that are assayed in an experiment. Droplets that contain one or more cells but no beads will result in a lost cells whose transcriptome is not measured. The probability that a droplet has no beads is $e^{-\mu}$ and therefore the probability that a droplet has at least one bead is $1-e^{-\mu}$. To raise the capture rate it is therefore desirable to increase the Poisson rate $\mu$ which is equal to the average number of beads in a droplet. However increasing $\mu$ leads to duplication, i.e. cases where a single droplet has more than one bead, thus leading .a single cell transcriptome to appear as two or more cells. The duplication rate is the fraction of assayed cells which were captured with more than one bead. The duplication rate can be calculated as $\frac{\mathbb{P}(\mbox{droplet has 2 or more beads})}{\mathbb{P}(\mbox{droplet has 1 or more beads})}$ (which happens to be equivalent to a calculation of the probability that we are alone in the universe). The tradeoff, shown quantitatively as capture rate vs. duplication rate, is shown in a figure I made for the slide.
19. This slide shows a comparison of bead technologies. In addition to inDrops, 10x Genomics single-cell RNA-seq technology also uses hydrogel beads. There are numerous technical details associated with beads including whether or not barcodes are released (and how).
20. Barcode collisions arise when two cells are separately encapsulated with beads that contain identical barcodes. The slide shows the barcode collision rate formula, which is $1-\left( 1-\frac{1}{M} \right)^{N-1}$. This formula is derived as follows: Let $p=\frac{1}{M}$. The probability that a barcodes is associated with k cells is given by the binomial formula ${N \choose k}p^k(1-p)^{N-k}$. Thus, the probability that a barcode is associated to exactly one cell is $Np(1-p)^{N-1} = \frac{N}{M}\left(1-\frac{1}{M}\right)^{N-1}$. Therefore the expected number of cells with a unique barcode is $N\left(1-\frac{1}{M}\right)^{N-1}$ and the barcode collision rate is $\left(1-\frac{1}{M}\right)^{N-1}$. This is approximately $1-\left( \frac{1}{e} \right)^{\frac{N}{M}}$. The term synthetic doublet is used to refer to the situation when two or more different cells appear to be a single cell due to barcode collision.
21. In the notation of the previous slide, the barcode diversity is $\frac{N}{M}$, which is an important number in that it determines the barcode collision rate. Since barcodes are encoded in sequence, a natural question is what sequence length is needed to ensure a desired level of barcode diversity. This slide provides a lower bound on the sequence length.
22. Technical doublets are multiple cells that are captured in a single droplet, barcoded with the same sequence, and thus the transcripts that are recorded from them appear to originate from a single-cell.  The technical doublet rate can be estimated using a calculation analogous to the one used for the cell duplication rate (slide 17), except that it is a function of the Poisson rate $\lambda$ and not $\mu$. In the single-cell RNA-seq literature the term “doublet” generally refers to technical doublets, although it is useful to distinguish such doublets from synthetic doublets and biological doublets (slide 25).
23. One way to estimate the technical doublet rate is via pooled experiments of cells from different species. The resulting data can be viewed in what has become known as a “Barnyard” plot, a term originating from the Macosko et al. 2015 Drop-seq paper. Despite the authors’ original intention to pool mouse, chicken, cow and pig cells, typical validation of single-cell technology now proceeds with a mixture of human and mouse cells. Doublets are readily identifiable in barnyard plots as points (corresponding to droplets) that display transcript sequence from two different species. The only way for this to happen is via the capture of a doublet of cells (one from each species) in a single droplet. Thus, validation of single-cell RNA-seq rigs via a pooled experiment, and assessment of the resultant barnyard plot, has become standard for ensuring that the data is indeed single cell.
24.  It is important to note that while points on the diagonal of barnyard plots do correspond to technical doublets, pooled experiments, say of human and mouse cells, will also result in human-human and mouse-mouse doublets that are not evident in barnyard plots. To estimate such within species technical doublets from a “barnyard analysis”, Bloom 2018 developed a simple procedure that is described in this slide. The “Bloom correction” is important to perform in order to estimate doublet rate from mixed species experiments.
25. Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension. Such doublets are by definition from the same species, and therefore will not be detected in barnyard plots. One way to avoid biological doublets is via single nucleus RNA-seq (see, Habib et al. 2017). Single nuclear RNA-seq has proved to be important for experiments involving cells that are difficult to dissociate, e.g. human brain cells. In fact, the formation of suspensions is a major bottleneck in the adoption of single-cell RNA-seq technology, as techniques vary by tissue and organism, and there is no general strategy. On the other hand, in a recent interesting paper, Halpern et al. 2018 show that biological doublets can sometimes be considered a feature rather than a bug. In any case, doublets are more complicated than initially appears, and we have by now seen that there are three types of doublets: synthetic doublets, technical doublets, and biological doublets .
26. Unique molecular identifiers (UMIs) are part of the oligo constructs for beads in droplet single-cell RNA-seq technologies. They are used to identify reads that originated from the same molecule, so that double counting of such molecules can be avoided. UMIs are generally much shorter than cell barcodes, and estimates of required numbers (and corresponding sequence lengths) proceed analogously to the calculations for cell barcodes.
27. The sensitivity of a single-cell RNA-seq technology (also called transcript capture rate) is the fraction of transcripts captured per cell. Crucially, the sensitivity of a technology is dependent on the amount sequenced, and the plot in this slide (made by Eduardo Beltrame and Valentine Svensson by analysis of data from Svensson et al. 2017) shows that dependency.  The dots in the figure are cells from different datasets that included spike-ins, whose capture is being measured (y-axis). Every technology displays an approximately linear relationship between number of reads sequenced and transcripts captured, however each line is describe by two parameters: a slope and an intercept. In other words, specification of sensitivity for a technology requires reporting two numbers, not one.
28. Having surveyed the different attributes of droplet technologies, this slide summarizes some of the pros and cons of inDrops similarly to slide 7 for SMART-Seq2.
29. While individual aspects of single-cell RNA-seq technologies can be readily understood via careful modeling coupled to straightforward quality control experiments, a comprehensive assessment of whether a technology is “good” or “bad” is meaningless. The figure in this slide, from Zhang et al. 2019, provides a useful overview of some of the important technical attributes.
30. In terms of the practical question “which technology should I choose for my experiment?”, the best that can be done is to offer very general decision workflows. The flowchart shown on this slide, also from Zhang et al. 2019, is not very specific but does provide some simple to understand checkpoints to think about. A major challenge in comparing and contrasting technologies for single-cell RNA-seq is that they are all changing very rapidly as numerous ideas and improvements are now published weekly.
31. The technologies reviewed in the slides are strictly transcriptomics methods (single-cell RNA-seq). During the past few years there has been a proliferation of novel multi-modal technologies that simultaneously measures the transcriptome of single cells along with other modalities. Such technologies are reviewed in Packer and Trapnell 2018 and the slide mentions a few of them.

One of my distinct memories from elementary school is going to “library class” to learn about the Dewey decimal classification and how to use a card catalog to find books. Searching for books efficiently was possible because cards in the catalog were sorted lexicographically.

It didn’t occur to me at the time, but the system required authors of books to be totally ordered. Without an ordering of authors in a book with multiple authors, there would be no way to decide where to place the card for the book in a catalog searchable by author. The practice of ordering authors on publications is evident in the oldest printed texts and has persisted to this day. I have never thought that it could be any other way.

However this past Wednesday I was visiting the University of Washington to deliver a seminar, and among the highlights of the visit was my meeting with the graduate students. I met 12 for lunch and two more came for dinner. Meeting with students is always my favorite part of a visit to a university. They have original and creative ideas, and most importantly, are not bound in their thought by archaic tradition. They frequently don’t know what one is supposed to think and how one is supposed to say it. They just think and speak!

One of the students I met on Wednesday was Vanessa Gray, a student of Doug Fowler, who in a conversation on authorship practices suggested to me the radical and brilliant idea that papers should be published without an ordering of authors.

Many journals now have a section called “Author contributions” where roles of individuals in collaborative projects can be described (many journals now require such descriptions). So why bother ordering the authors for a list underneath the title? As far as indexing and searching goes, Google and other search engines require only a set of authors, and not a specific ordering.

I agree with Vanessa that ending author ordering on publications would greatly improve fairness in the biological sciences, where many current projects involve complex assemblies of teams with complementary skills. “First authorship” is not well-defined when one author performed a large number of difficult experiments, and another developed novel algorithms and wrote complex software for analyzing the experiments. Similarly, “last authorship” fails as a concept when students are co-advised, or one principal investigator provides substantial funding on a project, while another is participating in doing the work. And recently, large consortium projects have completely destroyed any meaning of “author” by having hundreds, or even thousands of authors on projects. Even when there are relatively few authors people rarely credit anyone except the first and last authors, even if others did substantial work. In the recent ENCODE paper published in PNAS with 30 authors, it appears to me from the responses to my previous blog post about the paper that the 5th and 6th authors did a lot (majority?) of the work in putting together figures and results, yet I suspect the “credit” for the paper will go to the first author (the flip side in that case is that the first author is where blame is assigned as well).

There is also a theoretical justification for not ordering authors. Ordering of authors on a publication can be thought of as a ranking produced by “votes” of the participants in the project. Of course in practice not all votes are equal. In what is called dictatorship in social choice theory, PIs frequently make the decisions independently of how specific authors feel they may have contributed. This may work on a paper where there is a single PI (although it may be considered unfair by the graduate students and postdocs), however dictatorship as a system for determining authorship certainly breaks down when multiple PIs collaborate on a project. Arrow’s impossibility theorem explains that in the absence of dictatorship, there is a problem in producing a single ordering satisfying two other seemingly basic and essential fairness criteria. Informally, the theorem states that there is no authorship ordering system based on voting of contributing authors that can satisfy the following three criteria:

• If every author thinks that X should be ordered before Y, then the author list should have X placed before Y.
• For a fixed list of voting preferences regarding the ordering of X vs. Y, the ordering between X and Y in the author list will not depend on the ordering of other pairs such as X and Z, Y and Z, or Z and W.
• There is no “dictator”, i.e. no single author possesses the power to determine the author ordering.

Authors frequently have differing opinions about the impact of their own contribution to a publication, and therefore their preferences (votes) for author ordering are discordant. This means that any system for ordering authors will not satisfy everyone’s preferences, and in the sense of Arrow’s impossibility theorem will be unfair. One way around Arrow’s impossibility theorem is to specify authorship order without regard to authors’ preferences, for example by always ordering authors alphabetically (the Hardy-Littlewood rule). This method, usually the one used in the mathematical sciences, is also fraught with problems. Of course, listing author contributions for what they are is not entirely trivial. For example, different authors may have conflicting views about what it means to have “written the text of the paper”. But using words to describe contributions allow for much more detail about what each author did, and allows for nuanced contributions to be described (e.g., John and Jane were in the room when the initial idea for the project was discussed, but did not contribute anything afterwards).

To summarize, in the modern era of electronic publishing ordering of authors is unnecessary, and if it is unnecessary, then why confront Arrow’s theorem and inevitably produce orderings unfairly? Publications should just explain the author contributions. Time to end ordered authorship.

The card catalog at Yale University’s Sterling Memorial Library (from Wikipedia).

I visited Duke’s mathematics department yesterday to give a talk in the mathematical biology seminar. After an interesting day meeting many mathematicians and (computational) biologists, I had an excellent dinner with Jonathan Mattingly, Sayan MukherjeeMichael Reed and David Schaeffer. During dinner conversation, the topic of probability theory (and how to teach it) came up, and in particular Buffon’s needle problem.

The question was posed by Georges-Louis Leclerc, Comte de Buffon in the 18th century:

Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips?

If the strips are distance $t$ apart, and $l \leq t$, then it is easy to see that the probability $P$ is given by

$P = \int_{\theta =0}^{\frac{\pi}{2}} \int_{x = 0}^{\frac{l}{2}sin \theta} \frac{4}{t \pi} dx d\theta = \frac{2l}{t \pi}$.

The appearance of $\pi$ in the denominator turns the problem into a Monte Carlo technique for estimating $\pi$: simply simulate random needle tosses and count crossings.

It turns out there is a much more elegant solution to the problem– one that does not require calculus. I learned of it from Gian-Carlo Rota when I was a graduate student at MIT. It appears in his book Introduction to Geometric Probability (with Dan Klain) that I have occasionally used when teaching Math 249. The argument relies on the linearity of expectation, and is as follows:

Let $f(l)$ denote the expected number of crossings when a needle of length $l$ is thrown on the floor. Now consider two needles, one of length $l$ and the other $m$, attached to each other end to end (possibly at some angle). If $X_1$ is a random variable describing the number of crossings of the first needle, and $X_2$ of the second, its certainly the case that $X_1$ and $X_2$ are dependent, but because expectation is linear, it is the case that $E(X_1+X_2) = E(X_1)+E(X_2)$. In other words, the total number of crossings is, in expectation, $f(l)+f(m)$.

Buffon’s needle problem: what is the probability that a needle of length $l \leq t$ crosses a line? (A) A short needle being thrown at random on a floor with parallel lines. (B) Two connected needles. The expected number of crossings is proportional to the sum of their lengths. (C) A circle of diameter always crosses exactly two lines.

It follows that $f$ is a linear function, and since $f(0)=0$, we have that $f(l) = cl$ where $c$ is some constant. Now consider a circle of diameter $t$. Such a circle, when thrown on the floor, always crosses the parallel lines exactly twice. If $C$ is a regular polygon with vertices on the circle, and the total length of the polygon segments is $l$, then the total number of crossings is $f(l)$. Taking the limit as the number of segments in the polygon goes to infinity, we find that $f(t \pi ) = 2$. In other words,

$f(t \pi) = c \cdot t \pi = 2 \Rightarrow c = \frac{2}{t \pi}$,

and the expected number of crossings of a needle of length l is $\frac{2l}{t \pi}$. If $l < t$, the number of crossings is either 0 or 1, so the expected number of crossings is, by definition of expectation, equal to the probability of a single crossing. This solves Buffon’s problem no calculus required!

The linearity of expectation appears elementary at first glance. The proof is simple, and it is one of the first “facts” learned in statistics– I taught it to my math 10 students last week. However the apparent simplicity masks its depth and utility; the above example is cute, and one of my favorites, but linearity of expectation is useful in many settings. For example I recently saw an interesting application in an arXiv preprint by Anand Bhaskar, Andy Clark and Yun Song on “Distortion of genealogical properties when the sample is very large“.

The paper addresses an important question, namely the suitability of the coalescent as an approximation to discrete time random mating models, when sample sizes are large. This is an important question, because population sequencing is starting to involve hundreds of thousands, if not millions of individuals.

The results of Bhaskar, Clark and Song are based on dynamic programming calculations of various genealogical quantities as inferred from the discrete time Wright-Fisher model. An example is the expected frequency spectrum for random samples of individuals from a population. By frequency spectrum, they mean, for each k, the expected number of polymorphic sites with k derived alleles and n-k ancestral alleles under an infinite-sites model of mutation in a sample of n individuals. Without going into details (see their equations (8),(9) and (10)), the point is that they are able to derive dynamic programming recursions because they are computing the expected frequencies, and the linearity of expectation is what allows for the derivation of the dynamic programming recursions.

None of this has anything to do with my seminar, except for the fact that the expectation-maximization algorithm did make a brief appearance, as it frequently does in my lectures these days. I spoke mainly about some of the mathematics problems that arise in comparative transcriptomics, with a view towards a principled approach to comparing transcriptomes between cells, tissues, individuals and species.

The Duke Chapel. While I was inside someone was playing the organ, and as I stared at the ceiling, I could have sworn I was in Europe.

Last Saturday I returned from Cold Spring Harbor Laboratories where I spoke at the Genome Informatics Meeting on Stories from the Supplement. On Monday I delivered the “Prestige Lecture” at a meeting of the Center for Science of Information on New Directions in the Science of Information and I started by talking about Cold Spring Harbor Laboratory (CSHL). That is because the Eugenics Record Office at CSHL is where Claude Shannon, famous father of information theory, wrapped up his Ph.D. in population genetics in 1939.

The fact that Shannon did his Ph.D. in population genetics– his Ph.D. was titled “An Algebra for Theoretical Genetics“– is unknown to most information theorists and population geneticists. It is his masters thesis that is famous (for good reason– it can be said to have started the digital revolution), and his paper in 1948 that founded information theory. But his Ph.D. thesis was impressive in its own right: its contents formed the beginning of my talk to the information theorists, and I summarize the interesting story below.

I learned about the details surrounding Shannon’s foray into biology from a wonderful final project paper written for the class The Structure of Engineering Revolutions in the Fall of 2001: Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai, Mathematical Theory of Claude Shannon. In 1939, Shannon’s advisor, Vannevar Bush, sent him to study genetics with Barbara Burks at the Eugenics Record Office at Cold Spring Harbor. That’s right, the Eugenics office was located at Cold Spring Harbor from 1910 until 1939, when it was closed down as a result of Nazi eugenics. Fortunately, Shannon was not very interested in the practical aspects of eugenics, and more focused on the theoretical aspects of genetics.

His work in genetics was a result of direction from Vannevar Bush, who knew about genetics via his presidency of the Carnegie Institution of Washington that ran the Cold Spring Harbor research center. Apparently Bush remarked to a colleague that “It occurred to me that, just as a special algebra had worked well in his hands on the theory of relays, another special algebra might conceivably handle some of the aspects of Mendelian heredity”. The main result of his thesis is his Theorem 12:

The notation $\lambda^{hij}_{klm}$ refers to genotype frequencies in a diploid population. The indices $h,i,j$ refer to alleles at three loci on one haplotype, and $k,l,m$ at the same loci on the other haplotype. The $p$ variables correspond to recombination crossover probabilities. $p_{00}$ is the probability of an even number of crossovers between both the 1st and 2nd loci, and the 2nd and 3rd loci. $p_{01}$ is the probability of an even number of crossovers between the 1st and 2nd loci but an odd number of crossovers between the 2nd and 3rd loci, and so on. Finally, the dot notation in the $\lambda$ represents summation over the index (these days one might use a $+$). The result is a formula for the population genotype frequencies $\mu^{hij}_{ilm}$ after $n$ generations. The derivation involves elementary combinatorics, specifically induction, but it is an interesting result and at the time was not something population geneticists had worked out. What I find impressive about it is that Shannon, apparently on his own, mastered the basic principles of (population) genetics of his time, and performed a calculation that is quite similar to many that are relevant in population genetics today. Bush wrote about Shannon “At the time that I suggested that he try his queer algebra on this subject, he did not even know what the words meant… “.

Why did Shannon not pursue a career in population genetics? The Eugenics Record Office closed shortly after he left and Bush discouraged him from continuing in the field, telling him that “few scientists are ever able to apply creatively a new and unconventional method furnished by some one else – at least of their own generation”. Thus, despite encouragement from a number of statisticians and geneticists that his work was novel and of interest, Shannon returned to electrical engineering. Shortly thereafter, the world got information theory.

Of course today population genetics has data, tons of it, and many interesting problems, including some that I think require insights and ideas from information theory. My Prestige Lecture was aimed at encouraging information theorists to return to their Shannon roots, and redirect their focus towards biology. I have been working with information theorist David Tse (academic grandson of Shannon) for the past year on de novo RNA-Seq assembly (a talk on our joint work with postdoc Sreeram Kannan was presented by Sreeram at the Genome Informatics meeting), and I believe the engagement of information theorists in biology would be of great benefit to both fields; in terms of biology, I see many applications of information theory beyond population genetics. Some back-and-forth has already started. Recently there have been some interesting papers using information theory to study genome signatures and compression, but I believe that there are many other fruitful avenues for collaboration. David and Sreeram were the only information theorists at CSHL last week (I think), but I hope that there will be many more at the 2014 meeting in Cambridge, UK!

The beach at Cold Spring Harbor. I took the photo on November 1st before my Genome Informatics keynote.

I have just returned from the Genome Informatics 2013 meeting at CSHL. Jennifer Harrow, Michael Schatz and James Taylor organized a fantastic event that I thoroughly enjoyed.

The purpose of this post is to provide a short summary of my keynote talk, which was titled “Stories from the Supplement” (the talk can be viewed here and the presentation downloaded here). The idea for talking about what goes on in the supplement of papers, was triggered by a specific event towards the end of the reviewing/editing process for the paper:

A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73.

After a thorough and productive review process, deftly handled by editor Tal Nawy whose work on our behalf greatly improved the quality of the paper, we were sent an email from the journal shortly before publication stating that “If the Online Methods section contains more than 10 equations, move the equation-heavy portions to a separate Supplementary Note“. This requirement made it essentially impossible to properly explain our model and method in the paper. After publishing lengthy supplements for the Cufflinks papers (see below) that I feel were poorly reviewed to the detriment of the paper, I decided it was time to talk about this issue in public.

C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter,
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515.
Supplementary Material: 42 pages.

C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter,
Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53.
Supplementary Material: 70 pages.

My talk contained three examples selected to make a number of points:

• Methods in the supplement frequently contain ideas that transcend the specifics of the paper. These ideas can be valuable in the long run, but when they are in the supplement it is harder to identify what they are and to appreciate their significance.
• Supplements frequently contain errors (my own included). These errors make it difficult for others to understand the methods and implement them independently.
• In RNA-Seq specifically, there are a number of methodological issues buried in the supplements of various papers that have caused confusion in the field.
• The constant push of methods to supplements is part of a general trend to overemphasize the importance of data while minimizing the relevance of methods.

The examples were as follows:

1. Fragment assignment: The idea of probabilistically assigning ambiguously mapped fragments in RNA-Seq is present in many papers, but at least for me, it was the math worked out in the supplements of those papers (and many conversations with my collaborators, especially Cole Trapnell and Adam Roberts) that made me realize the importance of fragment assignment for *Seq. I went on to explain how Nicolas Bray used these insights to develop a fragment assignment model for joint analysis of RNA-Seq. The result is the ability to magnify the effective coverage of individual samples from multiple samples, as shown in my talk using the GEUVADIS data:
In this plot each point represents the accuracy for the samples when quantified independently (black), or by our method (red/blue). The difference between red and blue has to do with a technical choice in the method that I explained in the talk.
2. I talked about the problem of using raw counts for RNA-Seq analysis. Returning to a theme I have discussed in talks and on my blog previously, I explained that even when the goal is differential analysis, raw counts are flawed because “wrong does not cancel out wrong”. The idea of using raw count quantification knowing it is inaccurate, but arguing that it doesn’t matter because the bias cancels during comparisons (e.g. in differential expression or eQTL analysis) is mathematically equivalent to the following:
Acknowledging that $\frac{1}{2}+\frac{2}{3} \neq \frac{3}{5}$ (obtained by summing numerators and then dividing by the sum of denominators) but arguing that it is ok to say that
$\frac{\frac{1}{2}+\frac{2}{3}}{\frac{3}{4}+\frac{5}{6}} = \frac{\frac{3}{5}}{\frac{8}{10}} = \frac{3}{4}$, which is obviously not correct (the answer is $\frac{14}{19}$).
A key point I made is that even though it might seem that the wrong answer is at least close to the correct answer, in practice, on real data, the differences can be significant. I showed an analysis done by Cole Trapnell using an extensive dataset generated in the Rinn lab for the Cufflinks 2 paper that makes this point.
3. I talked about the different units currently being used for RNA-Seq quantification, such as CPM, RPKM, FPKM and TPM (all of them appeared in various talks during the meeting). I discussed the history of the units, and why they were chosen, and argued in favor of simply using the relative abundance estimates (perhaps normalized by a constant factor, as in TPM). This point of view was first advocated by Bo Li and Colin Dewey in their RSEM paper, and I hope the community will adopt their point of view.

My penultimate slide showed this world map of high-throughput sequencers. I think this is a very cool map, as it shows (by proxy) the extraordinary extent of sequencing going on worldwide. However it is yet another example of a focus on data, and data generation, in genomics. Data is of course, very important, but I showed another map for methodsthat illustrates a very different thing: the extent of computational biology going on around the world. The methods map is made from visits to the Cufflinks website. I mashed it with the sequencer map to make the case that data and methods go hand-in-hand.

Sequencers of the world and users of Cufflinks.

When the organizers of ISMB 2013 kindly invited me to give a keynote presentation this year, I decided to use the opportunity to survey “sequence census” methods. These are functional genomics assays based on high throughput sequencing. It has become customary to append the suffix “-Seq” to such assays (e.g. RNA-Seq), and I therefore prefer the term *Seq where the * denotes a wildcard.

The starting point for preparing the talk was a molecular biology seminar I organized in the Spring of 2010, where we discussed new high-throughput sequencing based assays with a focus on the diverse range of applications being explored. At the time I had performed a brief literature search to find *Seq papers for students to present, and this was helpful as a starting point for building a more complete bibliography for my talk. Finding *Seq assays is not easy- there is no central repository for them- but after some work I put together a (likely incomplete) bibliography that is appended to the end of the post. Update: I’ve created a page for actively maintaining a bibliography of *Seq assays.

The goal for my talk was to distill what appears to be a plethora of complex and seemingly unrelated experiments (see, e.g., Drukier et al. on the *Seq list) into a descriptive framework useful for thinking about their commonalities and differences. I came up with this cartoonish figure that I briefly explain in the remainder of this post. In future posts I plan to review in more detail some of these papers and the research they have enabled.

A typical assay first involves thinking of a (molecular) measurement to be made. The problem of making the measurement is then “reduced” (in the computer science sense of the word) to sequencing. This means that the measurement will be inferred from sequencing bits of DNA from “target” sequences (created during the reduction), and then counting the resulting fragments.  It is important to keep in mind that the small fragments of DNA are sampled randomly from the targets, but the sampling may not be uniform.

The inference step is represented in the “Solve inverse problem” box in the figure, and involves developing a model of the experiment, together with an algorithm for inferring the desired measurement from the data (the sequenced DNA reads). Finally, the measurement becomes a starting point for further (computational) biology inquiry.  Read the rest of this entry »