You are currently browsing the category archive for the ‘education’ category.

My Caltech calculus professor, Tom Apostol, passed away yesterday May 8th 2016. When I arrived in his Math 1b class in the Fall of 1990 I thought, like most of my classmates, that I already knew calculus. I may have known it but I didn’t understand it. Apostol taught me the understanding part.

Apostol’s calculus books, affectionally called “Tommy I” and ‘Tommy II” were not just textbooks for students to memorize but rather mathematical wisdom and beauty condensed into a pair of books intended to transform grade-obsessed freshmen and sophomores into thinking human beings. Most of all, Apostol emphasized the idea that fundamental to mathematics is how one thinks about things, not just what one is thinking about. One of his iconic examples of this was the ice-cream-cone-proof that the focal property of an ellipse is a consequence of its definition as a section of a cone. Specifically, taking as the definition of an ellipse a plane curve obtained by intersecting an inclined plane with a cone


the goal is to both define the two foci, and then to derive the focal point property as illustrated below:


Apostol demonstrated the connection between conic sections and their foci via a proof and picture of Dandelin. His explanation, which I still remember from my freshman year in college, is beautiful (the excerpt below is from his linear algebra book):

Apostol1Apostol2Apostol didn’t invent Dandelin’s spheres but he knew they were “the right way” to think about conic sections, and he figured out “the right way” for each and every one of his explanations. His calculus books are exceptional for their introduction of integration before differentiation, his preference for axiomatic rather than mechanistic definition (e.g. of determinants) and his exercises that are “easy” when the material is understood “in the right way”. His course had a profound influence on my approach not only to mathematics, but to all of my learning in both the sciences and humanities.

One of Apostol’s signature traditions was his celebration of Gauss’ birthday. His classes were always filled with mathematical treats, but on April 30th every year he would hide a cake in the classroom before the students arrived and would serve an edible treat that day instead. Gauss turned 239 just last week. This seems to be a timely moment to take note of that prime number (Apostol was a number theorist) and to eat a slice of cake for Gauss, Apostol, and those who change our lives.



The Common Core State Standards Initiative was intended to establish standards for the curriculum for K–12 students in order to universally elevate the the quality of education in the United States. Whether the initiative has succeeded, or not, is a matter of heated debate. In particular, the merits of the mathematics standards are a contentious matter to the extent that colleagues in my math department at UC Berkeley have penned opposing opinions on the pages of the Wall Street Journal (see Frenkel and Wu vs. Ratner). In this post I won’t opine on the merits of the standards, but rather wish to highlight a shortcoming in the almost universal perspective on education that the common core embodies:

The emphasis on what K–12 students ought to learn about what is known has sidelined an important discussion about what they should learn about what is not known.

To make the point, I’ve compiled a list of unsolved problems in mathematics to match the topics covered in the common core. The problems are all well-known to mathematicians, and my only contribution is to select problems that (a) are of interest to research mathematicians (b) provide a good balance among the different areas of mathematics and (c) are understandable by students studying to (the highlighted) Common Core Standards.  For each grade/high school topic, the header is a link to the Common Core Standards. Based on the standards, I have selected a single problem to associate to to the grade/topic (although it is worth noting there were always a large variety to choose from). For each problem, I begin by highlighting the relevant common core curriculum which the problem is related to, followed by a warm up exercise to help introduce students to the problem. The warm ups are exercises that should be solvable by students with knowledge of the Common Core Standards. I then state the unsolved problem, and finally I provide (in brief!) some math background, context and links for those who are interested in digging deeper into the questions.

Disclaimer: it’s possible, though highly unlikely that any of the questions on the list will yield to “elementary” methods accessible to K–12 students. It is quite likely that many of the problems will remain unsolved in our lifetimes. So why bother introducing students to such problems? The point is that the questions reveal a sliver of the vast scope of mathematics, they provide many teachable moments and context for the mathematics that does constitute the common core, and (at least in my opinion) are fun to explore (for kids and adults alike). Perhaps most importantly, the unsolved problems and conjectures reveal that the mathematics taught in K–12 is right at the edge of our knowledge: we are always walking right along the precipice of mystery. This is true for other subjects taught in K–12 as well, and in my view this reality is one of the important lessons children can and should learn in school.

One good thing about the Common Core Standards, is that their structure allows, in principle, for the incorporation of standards for unsolved problems the students ought to know about. Hopefully education policymakers will consider extending the Common Core Standards to include such content. And hopefully educators will not only teach about what is not known, but will also encourage students to ask new questions that don’t have answers. This is because  “there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”


Relevant common core: “describing shapes and space.”

Warm up: can you color the map of Africa with four colors so that no two countries that touch are filled in with the same color?


Can you color in the map with three colors so that no two countries that touch are filled in with the same color?

The unsolved problem: without trying all possibilities, can you tell when a map can be colored in with 3 colors so that no two countries that touch are filled in with the same color?

Background and context: The four color theorem states (informally) that “given any separation of a plane into contiguous regions, producing a figure called a map, no more than four colors are required to color the regions of the map so that no two adjacent regions have the same color.” (from wikipedia). The mathematics statement is that any planar graph can be colored with four colors. Thus, the first part of the “warm up” has a solution; in fact the world map can be colored with four colors. The four color theorem is deceivingly simple- it can be explored by a kindergartner, but it turns out to have a lengthy proof. In fact, the proof of the theorem requires extensive case checking by computer. Not every map can be colored with three colors (for an example illustrating why see here). It is therefore natural to ask for a characterization of which maps can be 3-colored. Of course any map can be tested for 3-colorability by trying all possibilities, but a “characterization” would involve criteria that could be tested by an algorithm that is polynomial in the number of countries. The 3-colorability of planar graphs is NP-complete.

First grade

Relevant common core: “developing understanding of whole number relationships”.

Warm up: Suppose that in a group of people, any pair of individuals are either strangers or acquaintances. Show that among three people there must be at either at least two pairs of strangers or else at least two pairs of acquaintances.

The unsolved problem: Is it true that among 45 people there must be 5 mutual strangers or 5 mutual acquaintances?

Background and context: This problem is formally known as the question of computing the Ramsey number R(5,5). It is an easier (although probably difficult for first graders) problem to show that R(3,3)=6, that is, that among six people there will be either three mutual strangers or else three mutual acquaintances. It is known that 43 \leq R(5,5) \leq 49. The difficulty of computing Ramsey numbers was summed up by mathematician Paul Erdös as follows:

“Imagine an alien force, vastly more powerful than us, landing on Earth and demanding the value of R(5, 5) or they will destroy our planet. In that case we should marshal all our computers and all our mathematicians and attempt to find the value. But suppose, instead, that they ask for R(6, 6). In that case we should attempt to destroy the aliens.” – from Ten Lectures on the Probabilistic Method.

Second grade

Relevant common core: “building fluency with addition and subtraction”.

Warm up: Pascal’s triangle is a triangular array of numbers where each entry in a row is computed by adding up the pair of numbers above it. For example, the first six rows of Pascal’s triangle are:


Write out the next row of Pascal’s triangle.

The unsolved problem: Is there a number (other than 1) that appears 10 times in the (infinite) Pascal’s triangle?

Background and context: The general problem of determining whether numbers can appear with arbitrary multiplicity in Pascal’s triangle is known as Singmaster’s conjecture. It is named after the mathematician David Singmaster who posed the problem in 1971. It is known that the number 3003 appears eight times, but it is not known whether any other number appears eight times, nor, for that matter, whether any other number appears more than eight times.

Third grade

Relevant common core: “(1) developing understanding of multiplication and division and strategies for multiplication and division within 100”.

Warm up: Practice dividing numbers by 2 and multiplying by 3.

The unsolved problem: Choose a natural number n. If n is even, divide it by 2 to get n\div 2. If n is odd, multiply it by 3 and add 1 to obtain 3\times n+1. Repeat the process. Show that for any initial choice n, the repeating process will eventually reach the number 1.

Background and context: The conjecture is called the Collatz conjecture, after Lothar Collatz who proposed it in 1937. It is deceptively simple, but despite much numeric evidence that it is true, has eluded proof. Mathematician Terence Tao has an interesting blog post explaining why (a) the conjecture is likely to be true and (b) why it is likely out of reach of current techniques known to mathematicians.

Fourth grade

Relevant common core: “Determine whether a given whole number in the range 1-100 is prime or composite”.

Warm up: Write the number 100 as the sum of two prime numbers.

The unsolved problem: Show that every even integer greater than 2 can be expressed as the sum of two primes.

Background and context:  This problem is known as the Goldbach conjecture. It was proposed by the mathematician Christian Goldbach in a letter to the mathematician Leonhard Euler in 1742 and has been unsolved since that time (!) In 2013 mathematician Harald Helfgott settled the “weak Goldbach conjecture“, proving that every odd integer greater than 5 is the sum of three primes.

Fifth grade

Relevant common core: “Graph points on the coordinate plane”.

Warm up: A set of points in the coordinate plane are in general position if no three of them lie on a line. A quadrilateral is convex if it does not intersect itself and the line between any two points on the boundary lies entirely within the quadrilateral. Show that any set of five points in the plane in general position has a subset of four points that form the vertices of a convex quadrilateral.

The unsolved problem: A polygon is convex  if it does not intersect itself and the line between any two points on the boundary lies entirely within the polygon. Find the smallest number N so that any N points in the coordinate plane in general position contain a subset of 7 points that form the vertices of a convex polygon.

Background and context: The warm up exercise was posed by mathematician Esther Klein in 1933. The question led to the unsolved problem, which remains unsolved in the general case, i.e. how many points are required so that no matter how they are placed (in general position) in the plane there is a subset that form the vertices of a convex n-gon. There are periodic improvements in the upper bound (the most recent one posted on September 10th 2015), but the best current upper bound is still far from the conjectured answer.


A set of 8 points in the plane containing no convex pentagon.

Sixth grade

Relevant common core: “Represent three-dimensional figures using nets made up of rectangles and triangles”.

Warm upnet

The dodecahedron is an example of a convex polyhedron. A convex polyhedron is a polyhedron that does not intersect itself and that has the property that any line joining two points on the surface lies entirely within the polyhedron. Cut out the net of the dodecahedron (shown above) and fold it into a dodecahedron.

The unsolved problem: Does every convex polyhedron have at least one net?

Background and context: Albrecht Dürer was an artist and mathematician of the German Renaissance. The unsolved problem above is often attributed to him, and is known as Dürer’s unfolding problem. It was formally posed by the mathematician Geoffrey Shephard in 1975.


One of my favorite math art pieces: Albrecht Dürer’s engraving Melencolia I.

Seventh grade

Relevant common core: “Analyze proportional relationships and use them to solve real-world and mathematical problems.”

Warm up: Two runners, running at two different speeds v_0 and v_1, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/2 from the other runner at time t. If both runners all start at the same place at t=0, show that the runners will both be lonely at time t=\frac{1}{2(v_1-v_0)}.

The unsolved problem: Eight runners, each running at a speed different from that of the others, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/8 from every other runner at time t. If the runners all start at the same place at t=0, show that each of the eight runners will be lonely at some time.


Background and context: This problem is known as the lonely runner conjecture and was proposed by mathematician J.M Wills in 1967. It has been proved for up to seven runners, albeit with lengthy arguments that involve lots of case checking. It is currently unsolved for eight or more runners.

Eighth grade

Relevant common core: “Know and apply the properties of integer exponents”.

Warm up: Show that 3^{3n}+[2(3^n)]^3 = 3^{3n+2} for any integer greater than or equal to 1.

The unsolved problem: If A^x+B^y=C^z where A,B,C,x,y and are positive integers with x,y,z>2 then A,B and C have a common prime factor.

Background and context: This problem is known as Beal’s conjecture (named after the billionaire Andrew Beal). The famous “Fermat’s Last Theorem“, proved via the modularity theorem for semistable elliptic curves,  is a special case of this conjecture, an observation that hints at the difficulty of the conjecture. Andrew Beal is currently offering a prize of $1 million for its solution.

High School: Number and Quantity

Relevant common core: “In high school, students will be exposed to yet another extension of number, when the real numbers are augmented by the imaginary numbers to form the complex numbers.”

Warm up: Gaussian integers are numbers of the form a+bi where a and b are integers. The norm of a Gaussian integer a+bi is a^2+b^2. A Gaussian prime is a Gaussian integer that cannot be factored into Gaussian integers of smaller norm. Show that 4+i is a Gaussian prime.

The unsolved problem: Consider a circle in R2 with centre at the origin and radius r \geq 0. How many points there are inside this circle of the form (m,n) where m and n are both integers. In particular, denoting this number by N(r), find bounds on E(r) where N(r) = \pi r^2 + E(r).


Background and context: The problem is known as Gauss’ circle problem, and while phrased in terms of integer points in the plane, it is equivalent to asking for the number of Gaussian integers with norm less than a given constant.

High School: Algebra

Relevant common core: “Solve systems of equations”.

Warm up: An Euler brick is a cuboid whose edges and face diagonals all have integer length:


Algebraically, an Euler brick requires finding integers a,b,c,d,e,f such that

a^2+b^2=d^2, a^2+c^2=e^2 and b^2+c^2=f^2. Verify that (a,b,c,d,e,f) = (85, 132, 720,157, 725, 732) is an Euler brick.

The unsolved problem: A perfect cuboid is an Euler brick whose space diagonal g (see below) also has integer length:


Is there a perfect cuboid?

Background and context: The existence of perfect cuboids requires solving the systems of equations for the Euler brick with the addition of the requirement that a^2+b^2+c^2=g^2 with g an integer. The solution of (non-linear) systems of equations in many unknowns is the subject matter of algebraic geometry, in which a bridge is developed between the algebra and a corresponding geometry. Generally speaking the equations are easiest to solve when the solutions can be complex, harder when they are required to be real numbers (real algebraic geometry) and hardest when they are constrained to be integers (Diophantine, or arithmetic algebraic geometry).

High School: Functions

Relevant common core: “Understand the concept of a function and use function notation”.

Warm up: Euler’s totient function denoted by \varphi(n) assigns to each positive integer n the number of positive integers that are less than or equal to n and relatively prime to n. What is \varphi(p^k) when p is a prime number?

The unsolved problem: Is it true that for every n there is at least one other integer m \neq n with the property that \varphi(m) = \varphi(n)?

Background and context: The question is known as Carmichael’s conjecture, who posited that the answer is affirmative. Curiously, it has been proved (in The Distribution of Totients by Kevin Ford, 1998) that any counterexample to the conjecture must be larger than 10^{10^{10}}. Yet the problem is unsolved.

High School: Modeling

Relevant common core: “Modeling is the process of choosing and using appropriate mathematics and statistics to analyze empirical situations, to understand them better, and to improve decisions.”

Warm up: Read the biology paper The Biological Underpinnings of Namib Desert Fairy Circles and the mathematical modeling paper Adopting a spatially explicit perspective to study the mysterious fairy circles of Namibia (some basic calculus is required).

The unsolved problem: Develop a biologically motivated mathematical model that explains the fairy circles in Namibia.


Background and context: Fairy circles occur in Southern Africa, mostly in Namibia but also in South Africa. The circles are barren patches of land surrounded by vegetation, and they appear to go through life cycles of dozens of years. There have been many theories about them but despite a number of plausible models the phenomenon is poorly understood and their presence is considered “one of nature’s great mysteries“.

High School: Geometry

Relevant common core: “An understanding of the attributes and relationships of geometric objects can be applied in diverse contexts”.

Warm up: Calculate the area of the handset shape shown moving in the figure below. It consists of two quarter-circles of radius 1 on either side of a 1 by 4/π rectangle from which a semicircle of radius \scriptstyle 2/\pi\, has been removed.

The unsolved problem: What is the rigid two-dimensional shape of largest area that can be maneuvered through an L-shaped planar region (as shown above) with legs of unit width.

Background and context: This problem was posed by Leo Moser in 1966 and is known as the the moving sofa problem. It is known that the largest area for a sofa is between 2.2195 and 2.8284. The problem should be familiar to college age students who have had to manouever furniture in and out of dorm rooms.

High School: Statistics & Probability

Relevant common core: “Describe events as subsets of a sample space (the set of outcomes) using characteristics (or categories) of the outcomes, or as unions, intersections, or complements of other events (‘or,’ ‘and,’ ‘not’).”

Warm up: A family of sets is said to be union-closed if the union of any two sets from the family remains in the family. Write down five different examples of families of sets that are union-closed.

The unsolved problem: Show that for any finite union-closed family of finite sets, other than the family consisting only of the empty set, there exists an element that belongs to at least half of the sets in the family.

Background and context: The conjecture is known as Frankl’s conjecture, named after the mathematician Péter Frankl, and is also called the union closed sets conjecture. It is deceptively simple, but is known to be true only in a few very special cases.

When I was an undergraduate at Caltech I took a combinatorics course from Rick Wilson who taught from his then just published textbook A Course in Combinatorics (co-authored with J.H. van Lint). The course and the book emphasized design theory, a subject that is beautiful and fundamental to combinatorics, coding theory, and statistics, but that has sadly been in decline for some time. It was a fantastic course taught by a brilliant professor- an experience that had a profound impact on me. Though to be honest, I haven’t thought much about designs in recent years. Having kids changed that.

A few weeks ago I was playing the card game Colori with my three year old daughter. It’s one of her favorites.


The game consists of 15 cards, each displaying drawings of the same 15 items (beach ball, boat, butterfly, cap, car, drum, duck, fish, flower, kite, pencil, jersey, plane, teapot, teddy bear), with each item colored using two of the colors red, green, yellow and blue. Every pair of cards contains exactly one item that is colored exactly the same. For example, the two cards the teddy bear is holding in the picture above are shown below:


The only pair of items colored exactly the same are the two beach balls. The gameplay consists of shuffling the deck and then placing a pair of cards face-up. Players must find the matching pair, and the first player to do so keeps the cards. This is repeated seven times until there is only one card left in the deck, at which point the player with the most cards wins. When I play with my daughter “winning” consists of enjoying her laughter as she figures out the matching pair, and then proceeds to try to eat one of the cards.

An inspection of all 15 cards provided with the game reveals some interesting structure:


Every card contains exactly one of each type of item. Each item therefore occurs 15 times among the cards, with fourteen of the occurrences consisting of seven matched pairs, plus one extra. This is a type of partially balanced incomplete block design. Ignoring for a moment the extra item placed on each card, what we have is 15 items, each colored one of seven ways (v=15*7=105). The 105 items have been divided into 15 blocks (the cards), so that b=15. Each block contains 14 elements (the items) so that k=14, and each element appears in two blocks (r=2). If every pair of different (colored) items occurred in the same number of cards, we would have a balanced incomplete block design, but this is not the case in Colori. Each item occurs in the same block as 26 (=2*13) other items (we are ignoring the extra item that makes for 15 on each card), and therefore it is not the case that every pair of items occurs in the same number of blocks as would be the case in a balanced incomplete block design. Instead, there is an association scheme that provides extra structure among the 105 items, and in turn describes the way in which items do or do not appear together on cards. The association scheme can be understood as a graph whose nodes consist of the 105 items, with edges between items labeled either 0,1 or 2. An edge between two items of the same type is labeled 0, edges between different items that appear on the same card are labeled 1, and edges between different items that do not appear on the same card are labeled 2. This edge labeling is called an “association scheme” because it has a special property, namely the number of triangles with a base edge labeled k, and other two edges labeled i and respectively is  dependent only on i,j and k and not on the specific base edge selected. In other words, there is a special symmetry to the graph. Returning to the deck of cards, we see that every pair of items appears in the same card exactly 0 or 1 times, and the number depends only on the association class of the pair of objects. This is called a partially balanced incomplete block design.

The author of the game, Reinhard Staupe, made it a bit more difficult by adding an extra item to each card making the identification of the matching pair harder. The addition also ensures that each of the 15 items appears on each card. Moreover, the items are permuted in location on the cards, in an arrangement similar to a latin square, making it hard to pair up the items. And instead of using 8 different colors, he used only four, producing the eight different “colors” of each item on the cards by using pairwise combinations of the four.  The yellow-green two-colored beach balls are particularly difficult to tell apart from the green-yellow ones. Of course, much of this is exactly the kind of thing you would want to do if you were designing an RNA-Seq experiment!

Instead of 15 types of items, think of 15 different strains of mice.  Instead of colors for the items, think of different cellular conditions to be assayed. Instead of one pair for each of seven color combinations, think of one pair of replicates for each of seven cellular conditions. Instead of cards, think of different sequencing centers that will prepare the libraries and sequence the reads. An ideal experimental setup would involve distributing the replicates and different cellular conditions across the different sequencing centers so as to reduce batch effect. This is the essence of part of the paper Statistical Design and Analysis of RNA Sequencing Data by Paul Auer and Rebecca Doerge. For example, in their Figure 4 (shown below) they illustrate the advantage of balanced block designs to ameliorate lane effects:


Figure 4 from P. Auer and R.W. Doerge’s paper Statistical Design and Analysis of RNA Sequencing Data.

Of course the use of experimental designs for constructing controlled gene expression experiments is not new. Kerr and Churchill wrote about the use of combinatorial designs in Experimental Design for gene expression microarrays, and one can trace back a long chain of ideas originating with R.A. Fisher. But design theory seems to me to be a waning art insofar as molecular biology experiments are concerned, and it is frequently being replaced with biological intuition of what makes for a good control. The design of good controls is sometimes obvious, but not always. So next time you design an experiment, if you have young kids, first play a round of Colori. If the kids are older, play Set instead. And if you don’t have any kids, plan for an extra research project, because what else would you do with your time?

In the Jeopardy! game show contestants are presented with questions formulated as answers that require answers in the form questions. For example, if a contestant selects “Normality for $200” she might be shown the following clue:

“The average \frac{x_1+x_2+\cdots + x_n}{n},”

to which she would reply “What is the maximum likelihood estimate for the mean of independent identically distributed Gaussian random variables from which samples x_1,x_2,\ldots,x_n have been obtained?” Host Alex Trebek would immediately exclaim “That is the correct answer for $200!”

The process of doing mathematics involves repeatedly playing Jeopardy! with oneself in an unending quest to understand everything just a little bit better. The purpose of this blog post is to provide an exposition of how this works for understanding principal component analysis (PCA): I present four Jeopardy clues in the “Normality” category that all share the same answer: “What is principal component analysis?” The post was motivated by a conversation I recently had with a well-known population geneticist at a conference I was attending. I mentioned to him that I would be saying something about PCA in my talk, and that he might find what I have to say interesting because I knew he had used the method in many of his papers. Without hesitation he replied that he was well aware that PCA was not a statistical method and merely a heuristic visualization tool.

The problem, of course, is that PCA does have a statistical interpretation and is not at all an ad-hoc heuristic. Unfortunately, the previously mentioned population geneticist is not alone; there is a lot of confusion about what PCA is really about. For example, in one textbook it is stated that “PCA is not a statistical method to infer parameters or test hypotheses. Instead, it provides a method to reduce a complex dataset to lower dimension to reveal sometimes hidden, simplified structure that often underlie it.” In another one finds out that “PCA is a statistical method routinely used to analyze interrelationships among large numbers of objects.” In a highly cited review on gene expression analysis PCA is described as “more useful as a visualization technique than as an analytical method” but then in a paper by Markus Ringnér  titled the same as this post, i.e. What is principal component analysis? in Nature Biotechnology, 2008, the author writes that “Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set” (the author then avoids going into the details because “understanding the details underlying PCA requires knowledge of linear algebra”). All of these statements are both correct and incorrect and confusing. A major issue is that the description by Ringnér of PCA in terms of the procedure for computing it (singular value decomposition) is common and unfortunately does not shed light on when it should be used. But knowing when to use a method is far more important than knowing how to do it.

I therefore offer four Jeopardy! clues for principal component analysis that I think help to understand both when and how to use the method:

1. An affine subspace closest to a set of points.

Suppose we are given numbers x_1,\ldots,x_n as in the initial example above. We are interested in finding the “closest” number to these numbers. By “closest” we mean in the sense of total squared difference. That is, we are looking for a number m such that \sum_{i=1}^n (m-x_i)^2 is minimized.

This is a (straightforward) calculus problem, solved by taking the derivative of the function above and setting it equal to zero. If we let f(m) = \sum_{i=1}^n (m-x_i)^2 then f'(m) = 2 \cdot \sum_{i=1}^n (m-x_i) and setting f'(m)=0 we can solve for m to obtain m = \frac{1}{n} \sum_{i=1}^n x_i.

The right hand side of the equation is just the average of the n numbers and the optimization problem provides an interpretation of it as the number minimizing the total squared difference with the given numbers (note that one can replace squared difference by absolute value, i.e. minimization of \sum_{i=1}^n |m-x_i|, in which case the solution for is the median; we return to this point and its implications for PCA later).

Suppose that instead of n numbers, one is given n points in \mathbb{R}^p. That is, point is {\bf x}^i = (x^i_1,\ldots,x^i_p). We can now ask for a point {\bf m} with the property that the squared distance of {\bf m} to the n points is minimized. This is asking for min_{\bf m} \sum_{i=1}^n ||{\bf m}-{\bf x}^i||_2.

The solution for m can be obtained by minimizing each coordinate independently, thereby reducing the problem to the simpler version of numbers above, and it follows that {\bf m} = \frac{1}{n} \sum_{i=1}^n {\bf x}^i.

This is 0-dimensional PCA, i.e., PCA of a set of points onto a single point, and it is the centroid of the points. The generalization of this concept provides a definition for PCA:

Definition: Given n points in \mathbb{R}^p, principal components analysis consists of choosing a dimension k < p and then finding the affine space of dimension with the property that the squared distance of the points to their orthogonal projection onto the space is minimized.

This definition can be thought of as a generalization of the centroid (or average) of the points. To understand this generalization, it is useful to think of the simplest case that is not 0-dimensional PCA, namely 1-dimensional PCA of a set of points in two dimensions:

PCA_Figure1In this case the 1-dimensional PCA subspace can be thought of as the line that best represents the average of the points. The blue points are the orthogonal projections of the points onto the “average line” (see, e.g., the red point projected orthogonally), which minimizes the squared lengths of the dashed lines. In higher dimensions line is replaced by affine subspace, and the orthogonal projections are to points on that subspace. There are a few properties of the PCA affine subspaces that are worth noting:

  1. The set of PCA subspaces (translated to the origin) form a flagThis means that the PCA subspace of dimension k is contained in the PCA subspace of dimension k+1. For example, all PCA subspaces contain the centroid of the points (in the figure above the centroid is the green point). This follows from the fact that the PCA subspaces can be incrementally constructed by building a basis from eigenvectors of a single matrix, a point we will return to later.
  2. The PCA subspaces are not scale invariant. For example, if the points are scaled by multiplying one of the coordinates by a constant, then the PCA subspaces change. This is obvious because the centroid of the points will change. For this reason, when PCA is applied to data obtained from heterogeneous measurements, the units matter. One can form a “common” set of units by scaling the values in each coordinate to have the same variance.
  3. If the data points are represented in matrix form as an n \times p matrix X, and the points orthogonally projected onto the PCA subspace of dimension are represented as in the ambient p dimensional space by a matrix \tilde{X}, then \tilde{X} = argmin_{M:rk(M)=k} ||X-M||_2. That is, \tilde{X} is the matrix of rank k with the property that the Frobenius norm ||X-\tilde{X}||_2 is minimized. This is just a rephrasing in linear algebra of the definition of PCA given above.

At this point it is useful to mention some terminology confusion associated with PCA. Unfortunately there is no standard for describing the various parts of an analysis. What I have called the “PCA subspaces” are also sometimes called “principal axes”. The orthogonal vectors forming the flag mentioned above are called “weight vectors”, or “loadings”. Sometimes they are called “principal components”, although that term is sometimes used to refer to points projected onto a principal axis. In this post I stick to “PCA subspaces” and “PCA points” to avoid confusion.

Returning to Jeopardy!, we have “Normality for $400” with the answer “An affine subspace closest to a set of points” and the question “What is PCA?”. One question at this point is why the Jeopardy! question just asked is in the category “Normality”. After all, the normal distribution does not seem to be related to the optimization problem just discussed. The connection is as follows:

2. A generalization of linear regression in which the Gaussian noise is isotropic.

PCA has an interpretation as the maximum likelihood parameter of a linear Gaussian model, a point that is crucial in understanding the scope of its application. To explain this point of view, we begin by elaborating on the opening Jeopardy! question about Normality for $200:

The point of the question was that the average of n numbers can be interpreted as a maximum likelihood estimation of the mean of a Gaussian. The Gaussian distribution is

f(x,\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}. Given the numbers x_1,\ldots,x_n, the likelihood function is therefore

L(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}. The maximum of this function is the same as the maximum of its logarithm, which is

log L(\mu,\sigma) = \sum_{i=1}^n \left( log \frac{1}{\sqrt{2 \pi \sigma^2}} -\frac{(x_i-\mu)^2}{2\sigma^2} \right). Therefore the problem of finding the maximum likelihood estimate for the mean is equivalent to that of finding the minimum of the function

S(\mu) = \sum_{i=1}^n (x_i-\mu)^2. This is exactly the optimization problem solved by 0-dimensional PCA, as we saw above. With this calculation at hand, we turn to the statistical interpretation of least squares:


Given n points \{(x_i,y_i)\}_{i=1}^n in the plane (see figure above),  the least squares line y=mx+b (purple in figure) is the one that minimizes the sum of the squares \sum_{i=1}^n \left( (mx_i+b) - y_i \right)^2. That is, the least squares line is the one minimizing the sum of the squared vertical distances to the points. As with the average of numbers, the least squares line has a statistical interpretation: Suppose that there is some line y=m^{*}x+b^{*} (black line in figure) that is unknown, but that “generated” the observed points, in the sense that each observed point was obtained by perturbing the point m^{*}x_i +b^{*} vertically by a random amount from a single Gaussian distribution with mean 0 and variance \sigma^2. In the figure, an example is shown where the blue point on the unknown line “generates” the observed red point; the Gaussian is indicated with the blue streak around the point. Note that the model specified so far is not fully generative, as it depends on the hidden points m^{*}x_i +b^{*} and there is no procedure given to generate the x_i. This can be done by positing that the x_i are generated from a Gaussian distribution along the line y=m^{*}x+b (followed by the points y_i generated by Gaussian perturbation of the y coordinate on the line). The coordinates x_i can then be deduced directly from the observed points as the Gaussian perturbations are all vertical. The relationship between the statistical model just described and least squares is made precise by a theorem (which we state informally, but is a special case of the Gauss-Markov theorem):

Theorem (Gauss-Markov): The maximum likelihood estimate for the line (the parameters and b) in the model described above correspond to the least squares line.

The proof is analogous to the argument given for the average of numbers above so we omit it. It can be generalized to higher dimensions where it forms the basis of what is known as linear regression. In regression, the x_i are known as independent variables and y the dependent variable. The generative model provides an interpretation of the independent variables as fixed measured quantities, whereas the dependent variable is a linear combination of the independent variables with added noise. It is important to note that the origins of linear regression are in physics, specifically in work of Legendre (1805) and Gauss (1809) who applied least squares to the astronomical problem of calculating the orbits of comets around the sun. In their application, the independent variables were time (for which accurate measurements were possible with clocks; by 1800 clocks were accurate to less than 0.15 seconds per day) and the (noisy) dependent variable the measurement of location. Linear regression has become one of the most (if not the most) widely used statistical tools but as we now explain, PCA (and its generalization factor analysis), with a statistical interpretation that includes noise in the x_i variables, seems better suited for biological data.

The statistical interpretation of least squares can be extended to a similar framework for PCA. Recall that we first considered a statistical interpretation for least squares where an unknown line y=m^{*}x+b^{*} “generated” the observed points, in the sense that each observed point was obtained by perturbing the point m^{*}x_i +b^{*} vertically by a random amount from a single Gaussian distribution with mean 0 and variance \sigma^2. PCA can be understood analogously by replacing vertically by orthogonally (this is the probabilistic model of Collins et al., NIPS 2001 for PCA). However this approach is not completely satisfactory as the orthogonality of the perturbation is is not readily interpretable. Stated differently, it is not obvious what physical processes would generate points orthogonal to a linear affine subspace by perturbations that are always orthogonal to the subspace. In the case of least squares, the “vertical” perturbation corresponds to noise in one measurement (represented by one coordinate). The problem is in naturally interpreting orthogonal perturbations in terms of a noise model for measurements. This difficulty is resolved by a model called probabilistic PCA (pPCA), first proposed by Tipping and Bishop in a Tech Report in 1997, and published in the J. of the Royal Statistical Society B 2002,  and independently by Sam Roweis, NIPS 1998, that is illustrated visually in the figure below, and that we now explain:



In the pPCA model there is an (unknown) line (affine space in higher dimension) on which (hidden) points (blue) are generated at random according to a Gaussian distribution (represented by gray streak in the figure above, where the mean of the Gaussian is the green point). Observed points (red) are then generated from the hidden points by addition of isotropic Gaussian noise (blue smear), meaning that the Gaussian has a diagonal covariance matrix with equal entries. Formally, in the notation of Tipping and Bishop, this is a linear Gaussian model described as follows:

Observed random variables are given by t = Wx + \mu + \epsilon where x are latent (hidden) random variables, W is a matrix describing a subspace and Wx+\mu are the latent points on an affine subspace (\mu corresponds to a translation). Finally, \epsilon is an error term, given by a Gaussian random variable with mean 0 and covariance matrix \sigma^2 I. The parameters of the model are W,\mu and \sigma^2. Equivalently, the observed random variables are themselves Gaussian, described by the distribution t \sim \mathcal{N}(\mu,WW^T + \psi) where \psi \sim \mathcal{N}(0,\sigma^2I). Tipping and Bishop prove an analogy of the Gauss-Markov theorem, namely that the affine subspace given by the maximum likelihood estimates of W and \mu is the PCA subspace (the proof is not difficult but I omit it and refer interested readers to their paper, or Bishop’s Pattern Recognition and Machine Learning book).

It is important to note that although the maximum likelihood estimates of W,\mu in the pPCA model correspond to the PCA subspace, only posterior distributions can be obtained for the latent data (points on the subspace). Neither the mode nor the mean of those distributions corresponds to the PCA points (orthogonal projections of the observations onto the subspace). However what is true, is that the posterior distributions converge to the PCA points as \sigma^2 \rightarrow 0. In other words, the relationship between pPCA and PCA is a bit more subtle than that between least squares and regression.

The relationship between regression and (p)PCA is shown in the figure below:

PCA_Figure4In the figure, points have been generated randomly according to the pPCA model. the black smear shows the affine space on which the points were generated, with the smear indicating the Gaussian distribution used. Subsequently the latent points (light blue on the gray line) were used to make observed points (red) by the addition of isotropic Gaussian noise. The green line is the maximum likelihood estimate for the space, or equivalently by the theorem of Tipping and Bishop the PCA subspace. The projection of the observed points onto the PCA subspace (blue) are the PCA points. The purple line is the least squares line, or equivalently the affine space obtained by regression (y observed as a noisy function of x). The pink line is also a regression line, except where is observed as a noisy function of y.

A natural question to ask is why the probabilistic interpretation of PCA (pPCA) is useful or necessary? One reason it is beneficial is that maximum likelihood inference for pPCA involves hidden random variables, and therefore the EM algorithm immediately comes to mind as a solution (the strategy was suggested by both Tipping & Bishop and Roweis). I have not yet discussed how to find the PCA subspace, and the EM algorithm provides an intuitive and direct way to see how it can be done, without the need for writing down any linear algebra:


The exact version of the EM shown above is due to Roweis. In it, one begins with a random affine subspace passing through the centroid of the points. The “E” step (expectation) consists of projecting the points to the subspace. The projected points are considered fixed to the subspace. The “M” step (maximization) then consists of rotating the space so that the total squared distance of the fixed points on the subspace to the observed points is minimized. This is repeated until convergence. Roweis points out that this approach to finding the PCA subspace is equivalent to power iteration for (efficiently) finding eigenvalues of the the sample covariance matrix without computing it directly. This is our first use of the word eigenvalue in describing PCA, and we elaborate on it, and the linear algebra of computing PCA subspaces later in the post.

Another point of note is that pPCA can be viewed as a special case of factor analysis, and this connection provides an immediate starting point for thinking about generalizations of PCA. Specifically, factor analysis corresponds to the model t \sim \mathcal{N}(\mu,WW^T + \psi) where the covariance matrix \psi is less constrained, and only required to be diagonal. This is connected to a comment made above about when the PCA subspace might be more useful as a linear fit to data than regression. To reiterate, unlike physics, where some coordinate measurements have very little noise in comparison to others, biological measurements are frequently noisy in all coordinates. In such settings factor analysis is preferable, as the variance in each coordinate is estimated as part of the model. PCA is perhaps a good compromise, as PCA subspaces are easier to find than parameters for factor analysis, yet PCA, via its pPCA interpretation, accounts for noise in all coordinates.

A final comment about pPCA is that it provides a natural framework for thinking about hypothesis testing. The book Statistical Methods: A Geometric Approach by Saville and Wood is essentially about (the geometry of) pPCA and its connection to hypothesis testing. The authors do not use the term pPCA but their starting point is exactly the linear Gaussian model of Tipping and Bishop. The idea is to consider single samples from n independent identically distributed independent Gaussian random variables as one single sample from a high-dimensional multivariate linear Gaussian model with isotropic noise. From that point of view pPCA provides an interpretation for Bessel’s correction. The details are interesting but tangential to our focus on PCA.

We are therefore ready to return to Jeopardy!, where we have “Normality for $600” with the answer “A generalization of linear regression in which the Gaussian noise is isotropic” and the question “What is PCA?”

3. An orthogonal projection of points onto an affine space that maximizes the retained sample variance.

In the previous two interpretations of PCA, the focus was on the PCA affine subspace. However in many uses of PCA the output of interest is the projection of the given points onto the PCA affine space. The projected points have three useful related interpretations:

  1. As seen in in section 1, the (orthogonally) projected points (red -> blue) are those whose total squared distance to the observed points is minimized.
  2. What we focus on in this section, is the interpretation that the PCA subspace is the one onto which the (orthogonally) projected points maximize the retained sample variance.
  3. The topic of the next section, namely that the squared distances between the (orthogonally) projected points are on average (in the l_2 metric) closest to the original distances between the points.



The sample variance of a set of points is the average squared distance from each point to the centroid. Mathematically, if the observed points are translated so that their centroid is at zero (known as zero-centering), and then represented by an n \times p matrix X, then the sample covariance matrix is given by \frac{1}{n-1}X^tX and the sample variance is given by the trace of the matrix. The point is that the jth diagonal entry of \frac{1}{n-1}X^tX is just \frac{1}{n-1}\sum_{i=1}^n (x^i_j)^2, which is the sample variance of the jth variable. The PCA subspace can be viewed as that subspace with the property that the sample variance of the projections of the observed points onto the subspace is maximized. This is easy to see from the figure above. For each point (blue), Pythagoras’ theorem implies that d(red,blue)^2+d(blue,green)^2 = d(red,green)^2. Since the PCA subspace is the one minimizing the total squared red-blue distances, and since the solid black lines (red-green distances) are fixed, it follows that the PCA subspace also maximizes the total squared green-blue distances. In other words, PCA maximizes the retained sample variance.

The explanation above is informal, and uses a 1-dimensional PCA subspace in dimension 2 to make the argument. However the argument extends easily to higher dimension, which is typically the setting where PCA is used. In fact, PCA is typically used to “visualize” high dimensional points by projection into dimensions two or three, precisely because of the interpretation provided above, namely that it retains the sample variance. I put visualize in quotes because intuition in two or three dimensions does not always hold in high dimensions. However PCA can be useful for visualization, and one of my favorite examples is the evidence for genes mirroring geography in humans. This was first alluded to by Cavalli-Sforza, but definitively shown by Lao et al., 2008, who analyzed 2541 individuals and showed that PCA of the SNP matrix (approximately) recapitulates geography:


Genes mirror geography from Lao et al. 2008: (Left) PCA of the SNP matrix (2541 individuals x 309,790 SNPs) showing a density map of projected points. (Right) Map of Europe showing locations of the populations .

In the picture above, it is useful to keep in mind that the emergence of geography is occurring in that projection in which the sample variance is maximized. As far as interpretation goes, it is useful to look back at Cavalli-Sforza’s work. He and collaborators who worked on the problem in the 1970s, were unable to obtain a dense SNP matrix due to limited technology of the time. Instead, in Menozzi et al., 1978 they performed PCA of an allele-frequency matrix, i.e. a matrix indexed by populations and allele frequencies instead of individuals and genotypes. Unfortunately they fell into the trap of misinterpreting the biological meaning of the eigenvectors in PCA. Specifically, they inferred migration patterns from contour plots in geographic space obtained by plotting the relative contributions from the eigenvectors, but the effects they observed turned out to be an artifact of PCA. However as we discussed above, PCA can be used quantitatively via the stochastic process for which it solves maximum likelihood inference. It just has to be properly understood.

To conclude this section in Jeopardy! language, we have “Normality for $800” with the answer “A set of points in an affine space obtained via projection of a set of given points so that the sample variance of the projected points is maximized” and the question “What is PCA?”

4. Principal component analysis of Euclidean distance matrices.

In the preceding interpretations of PCA, I have focused on what happens to individual points when projected to a lower dimensional subspace, but it is also interesting to consider what happens to pairs of points. One thing that is clear is that if a pair of points are projected orthogonally onto a low-dimensional affine subspace then the distance between the points in the projection is smaller than the original distance between the points. This is clear because of Pythagoras’ theorem, which implies that the squared distance will shrink unless the points are parallel to the subspace in which case the distance remains the same. An interesting observation is that in fact the PCA subspace is the one with the property where the average (or total) squared distances between the points is maximized. To see this it again suffices to consider only projections onto one dimension (the general case follows by Pythagoras’ theorem). The following lemma, discussed in my previous blog post, makes the connection to the previous discussion:

Lemma: Let x_1,\ldots,x_n be numbers with mean \overline{x} = \frac{1}{n}\sum_i x_i. If the average squared distance between pairs of points is denoted D = \frac{1}{n^2}\sum_{i,j} (x_i-x_j)^2 and the variance is denoted V=\frac{1}{n}\sum_i (x_i-\overline{x})^2 then V=\frac{1}{2}D.

What the lemma says is that the sample variance is equal to the average squared difference between the numbers (i.e. it is a scalar multiple that does not depend on the numbers). I have already discussed that the PCA subspace maximizes the retained variance, and it therefore follows that it also maximizes the average (or total) projected squared distance between the points. Alternately, PCA can be interpreted as minimizing the total (squared) distance that is lost, i,e. if the original distances between the points are given by a distance matrix D and the projected distances are given by \tilde{D}, then the PCA subspace minimizes \sum_{ij} (D^2_{ij} - \tilde{D}^2_{ij}), where each term in the sum is positive as discussed above.

This interpretation of PCA leads to an interesting application of the method to (Euclidean) distance matrices rather than points. The idea is based on a theorem of Isaac Schoenberg that characterizes Euclidean distance matrices and provides a method for realizing them. The theorem is well-known to structural biologists who work with NMR, because it is one of the foundations used to reconstruct coordinates of structures from distance measurements. It requires a bit of notation: is a distance matrix with entries d_{ij} and \Delta is the matrix with entries -\frac{1}{2}d^2_{ij}. {\bf 1} denotes the vector of all ones, and {\bf s} denotes a vector.

Theorem (Schoenberg, 1938): A matrix D is a Euclidean distance matrix if and only if the matrix B=(I-{\bf 1}{\bf s}')\Delta(I-{\bf s}{\bf 1}') is positive semi-definite where {\bf s}'{\bf 1} = 1.

For the case when {\bf s} is chosen to be a unit vector, i.e. all entries are zero except one of them equal to 1, the matrix B can be viewed as the Gromov transform (known as the Farris transform in phylogenetics) of the matrix with entries d^2_{ij}. Since the matrix is positive semidefinite it can be written as B=XX^t, where the matrix X provides coordinates for points that realize D. At this point PCA can be applied resulting in a principal subspace and points on it (the orthogonal projections of X). A point of note is that eigenvectors of XX^t can be computed directly, avoiding the need to compute X^tX which may be a larger matrix if n < p.

The procedure just described is called classic multidimensional scaling (MDS) and it returns a set of points on a Euclidean subspace with distance matrix \tilde{D} that best represent the original distance matrix D in the sense that \sum_{ij} (D^2_{ij} - \tilde{D}^2_{ij}) is minimized. The term multidimensional scaling without the “classic” has taken on an expanded meaning, namely it encapsulates all methods that seek to approximately realize a distance matrix by points in a low dimensional Euclidean space. Such methods are generally not related to PCA, but classic multidimensional scaling is PCA. This is a general source of confusion and error on the internet. In fact, most articles and course notes I found online describing the connection between MDS and PCA are incorrect. In any case classic multidimensional scaling is a very useful instance of PCA, because it extends the utility of the method to cases where points are not available but distances between them are.

Now we return to Jeopardy! one final time with the final question in the category: “Normality for $1000”. The answer is “Principal component analysis of Euclidean distance matrices” and the question is “What is classic multidimensional scaling?”

An example

To illustrate the interpretations of PCA I have highlighted, I’m including an example in R inspired by an example from another blog post (all commands can be directly pasted into an R console). I’m also providing the example because missing in the discussion above is a description of how to compute PCA subspaces and the projections of points onto them. I therefore explain some of this math in the course of working out the example:

First, I generate a set of points (in \mathbb{R}^2). I’ve chosen a low dimension so that pictures can be drawn that are compatible with some of the examples above. Comments following commands appear after the # character.

 set.seed(2)             #sets the seed for random number generation.
 x <- 1:100              #creates a vector x with numbers from 1 to 100
 ex <- rnorm(100, 0, 30) #100 normally distributed rand. nos. w/ mean=0, s.d.=30
 ey <- rnorm(100, 0, 30) # " " 
 y <- 30 + 2 * x         #sets y to be a vector that is a linear function of x
 x_obs <- x + ex         #adds "noise" to x
 y_obs <- y + ey         #adds "noise" to y
 P <- cbind(x_obs,y_obs) #places points in matrix
 plot(P,asp=1,col=1) #plot points
 points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center

At this point a full PCA analysis can be undertaken in R using the command “prcomp”, but in order to illustrate the algorithm I show all the steps below:

 M <- cbind(x_obs-mean(x_obs),y_obs-mean(y_obs))#centered matrix
 MCov <- cov(M)          #creates covariance matrix

Note that the covariance matrix is proportional to the matrix $M^tM$. Next I turn to computation of the principal axes:

 eigenValues <- eigen(MCov)$values       #compute eigenvalues
 eigenVectors <- eigen(MCov)$vectors     #compute eigenvectors

The eigenvectors of the covariance matrix provide the principal axes, and the eigenvalues quantify the fraction of variance explained in each component. This math is explained in many papers and books so we omit it here, except to say that the fact that eigenvalues of the sample covariance matrix are the principal axes follows from recasting the PCA optimization problem as maximization of the Raleigh quotient. A key point is that although I’ve computed the sample covariance matrix explicitly in this example, it is not necessary to do so in practice in order to obtain its eigenvectors. In fact, it is inadvisable to do so. Instead, it is computationally more efficient, and also more stable to directly compute the singular value decomposition of M. The singular value decomposition of M decomposes it into M=UDV^t where is a diagonal matrix and both U and V^t are orthogonal matrices. I will also not explain in detail the linear algebra of singular value decomposition and its relationship to eigenvectors of the sample covariance matrix (there is plenty of material elsewhere), and only show how to compute it in R:

 d <- svd(M)$d          #the singular values
 v <- svd(M)$v          #the right singular vectors

The right singular vectors are the eigenvectors of M^tM.  Next I plot the principal axes:


This shows the first principal axis. Note that it passes through the mean as expected. The ratio of the eigenvectors gives the slope of the axis. Next


shows the second principal axis, which is orthogonal to the first (recall that the matrix V^t in the singular value decomposition is orthogonal). This can be checked by noting that the second principal axis is also


as the product of orthogonal slopes is -1. Next, I plot the projections of the points onto the first principal component:

 trans <- (M%*%v[,1])%*%v[,1] #compute projections of points
 P_proj <- scale(trans, center=-cbind(mean(x_obs),mean(y_obs)), scale=FALSE) 
 points(P_proj, col=4,pch=19,cex=0.5) #plot projections
 segments(x_obs,y_obs,P_proj[,1],P_proj[,2],col=4,lty=2) #connect to points

The linear algebra of the projection is simply a rotation followed by a projection (and an extra step to recenter to the coordinates of the original points). Formally, the matrix M of points is rotated by the matrix of eigenvectors W to produce T=MW. This is the rotation that has all the optimality properties described above. The matrix T is sometimes called the PCA score matrix. All of the above code produces the following figure, which should be compared to those shown above:


There are many generalizations and modifications to PCA that go far beyond what has been presented here. The first step in generalizing probabilistic PCA is factor analysis, which includes estimation of variance parameters in each coordinate. Since it is rare that “noise” in data will be the same in each coordinate, factor analysis is almost always a better idea than PCA (although the numerical algorithms are more complicated). In other words, I just explained PCA in detail, now I’m saying don’t use it! There are other aspects that have been generalized and extended. For example the Gaussian assumption can be relaxed to other members of the exponential family, an important idea if the data is discrete (as in genetics). Yang et al. 2012 exploit this idea by replacing PCA with logistic PCA for analysis of genotypes. There are also many constrained and regularized versions of PCA, all improving on the basic algorithm to deal with numerous issues and difficulties. Perhaps more importantly, there are issues in using PCA that I have not discussed. A big one is how to choose the PCA dimension to project to in analysis of high-dimensional data. But I am stopping here as I am certain no one is reading at this far into the post anyway…

The take-home message about PCA? Always be thinking when using it!

Acknowledgment: The exposition of PCA in this post began with notes I compiled for my course MCB/Math 239: 14 Lessons in Computational Genomics taught in the Spring of 2013. I thank students in that class for their questions and feedback. None of the material presented in class was new, but the exposition was intended to clarify when PCA ought to be used, and how. I was inspired by the papers of Tipping, Bishop and Roweis on probabilistic PCA in the late 1990s that provided the needed statistical framework for its understanding. Following the class I taught, I benefited greatly from conversations with Nicolas Bray, Brielin Brown, Isaac Joseph and Shannon McCurdy who helped me to further frame PCA in the way presented in this post.

The Habsburg rulership of Spain ended with an inbreeding coefficient of F=0.254. The last king, Charles II (1661-1700), suffered an unenviable life. He was unable to chew. His tongue was so large he could not speak clearly, and he constantly drooled. Sadly, his mouth was the least of his problems. He suffered seizures, had intellectual disabilities, and was frequently vomiting. He was also impotent and infertile, which meant that even his death was a curse in that his lack of heirs led to a war.

None of these problems prevented him from being married (twice). His first wife, princess Henrietta of England, died at age 26 after becoming deeply depressed having being married to the man for a decade. Only a year later, he married another princess, 23 year old Maria Anna of Neuberg. To put it mildly, his wives did not end up living the charmed life of Disney princesses, nor were they presumably smitten by young Charles II who apparently aged prematurely and looked the part of his horrific homozygosity. The princesses married Charles II because they were forced to. Royals organized marriages to protect and expand their power, money and influence. Coupled to this were primogeniture rules which ensured that the sons of kings, their own flesh and blood and therefore presumably the best-suited to be in power, would indeed have the opportunity to succeed their fathers. The family tree of Charles II shows how this worked in Spain:


It is believed that the inbreeding in Charles II’s family led to two genetic disorders, combined pituitary hormone deficiency and distal renal tubular acidosis, that explained many of his physical and mental problems. In other words, genetic diversity is important, and the point of this blog post is to highlight the fact that diversity is important in education as well.

The problem of inbreeding in academia has been studied previously, albeit to a limited extent. One interesting article is Navel Grazing: Academic Inbreeding and Scientific Productivity by Horta et al published in 2010 (my own experience with an inbred academic from a department where 39% of the faculty are self-hires anecdotally confirms the claims made in the paper). But here I focus on the downsides of inbreeding of ideas rather than of faculty. For example home-schooling, the educational equivalent of primogeniture, can be fantastic if the parents happen to be good teachers, but can fail miserably if they are not. One thing that is guaranteed in a school or university setting is that learning happens by exposure to many teachers (different faculty, students, tutors, the internet, etc.) Students frequently complain when there is high variance in teaching quality, but one thing such variance ensures is that is is very unlikely that any student is exposed only to bad teachers. Diversity in teaching also helps to foster the development of new ideas. Different teachers, by virtue of insight or error, will occasionally “mutate” ideas or concepts for better or for worse. In other words, one does not have to fully embrace the theory of memes to acknowledge that there are benefits to variance in teaching styles, methods and pedagogy. Conversely, there is danger in homogeneity.

This brings me to MOOCs. One of the great things about MOOCs is that they reach millions of people. Udacity claims it has 1.6 million “users” (students?). Coursera claims 7.1 million. These companies are greatly expanding the accessibility of education. Starving children in India can now take courses in mathematical methods for quantitative finance, and for the first time in history, a president of the United States can discreetly take a freshman course on economics together with its high school algebra prerequisites (highly recommended). But when I am asked whether I would be interested in offering a MOOC I hesitate, paralyzed at the thought that any error I make would immediately be embedded in the brains of millions of innocent victims. My concern is this: MOOCs can greatly reduce the variance in education. For example, Coursera currently offers 641 courses, which means that each courses is or has been taught to over 11,000 students. Many college courses may have less than a few dozen students, and even large college courses rarely have more than a few hundred students. This means that on average, through MOOCs, individual professors reach many more (2 orders of magnitude!) students. A great lecture can end up positively impacting a large number of individuals, but at the same time, a MOOC can be a vehicle for infecting the brains of millions of people with nonsense. If that nonsense is then propagated and reaffirmed via the interactions of the people who have learned it from the same source, then the inbreeding of ideas has occurred.

I mention MOOCs because I was recently thinking about intuition behind Bessel’s correction replacing n with n-1 in the formula for sample variance. Formally, Bessel’s correction replaces the biased formula

s^2_n = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2

for estimating the variance of a random variable from samples x_1,\ldots,x_n with

s^2_{n-1} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2.

The switch from to n-1 is a bit mysterious and surprising, and in introductory statistics classes it is frequently just presented as a “fact”. When an explanation is provided, it is usually in the form of algebraic manipulation that establishes the result. The issue came up as a result of a blog post I’m writing about principal components analysis (PCA), and I thought I would check for an intuitive explanation online. I googled “intuition sample variance” and the top link was a MOOC from the Khan Academy:

The video has over 51,000 views with over 100 “likes” and only 6 “dislikes”. Unfortunately, in this case, popularity is not a good proxy for quality. Despite the title promising “review” and “intuition” for “why we divide by n-1 for the unbiased sample variance” there is no specific reason given why is replaced by n-1 (as opposed to another correction). Furthermore, the intuition provided has to do with the fact that x_i-\overline{x} underestimates x_i-\mu (where \mu is the mean of the random variable and \overline{x} is the sample mean) but the explanation is confusing and not quantitative (which it can easily be). In fact, the wikipedia page for Bessel’s correction provides three different mathematical explanations for the correction together with the intuition that motivates them, but it is difficult to find with Google unless one knows that the correction is called “Bessel’s correction”.

Wikipedia is also not perfect, and this example is a good one for why teaching by humans is important. Among the three alternative derivations, I think that one stands out as “better” but one would not know by just looking at the wikipedia page. Specifically, I refer to “Alternate 1” on the wikipedia page, that is essentially explaining that variance can be rewritten as a double sum corresponding to the average squared distance between points and the diagonal terms of the sum are zero in expectation. An explanation of why this fact leads to the n-1 in the unbiased estimator is as follows:

The first step is to notice that the variance of a random variable is equal to half of the expected squared difference of two independent identically distributed random variables of that type. Specifically, the definition of variance is:

var(X) = \mathbb{E}(X - \mu)^2 where \mu = \mathbb{E}(X). Equivalently, var(X) = \mathbb{E}(X^2) -\mu^2. Now suppose that Y is another random variable identically distributed to X and with X,Y independent. Then \mathbb{E}(X-Y)^2 = 2 var(X). This is easy to see by using the fact that

\mathbb{E}(X-Y)^2 = \mathbb{E}(X^2) + \mathbb{E}(Y^2) - 2\mathbb{E}(X)\mathbb{E}(Y) = 2\mathbb{E}(X^2)-2\mu^2.

This identity motivates a rewriting of the (uncorrected) sample variance s_n in a way that is computationally less efficient, but mathematically more insightful:

s_n = \frac{1}{2n^2} \sum_{i,j=1}^n (x_i-x_j)^2.

Of note is that in this summation exactly n of the terms are zero, namely the terms when i=j. These terms are zero independently of the original distribution, and remain so in expectation thereby biasing the estimate of the variance, specifically leading to an underestimate. Removing them fixes the estimate and produces

s_{n-1}^2 = \frac{1}{2n(n-1)} \sum_{i,j=1, i \neq j}^n (x_i-x_j)^2.

It is easy to see that this is indeed Bessel’s correction. In other words, the correction boils down to the fact that n^2-n = n(n-1), hence the appearance of n-1.

Why do I like this particular derivation of Bessel’s correction? There are two reasons: first, n-1 emerges naturally and obviously from the derivation. The denominator in s_{n-1}^2 matches exactly the number of terms being summed, so that it can be understood as a true average (this is not apparent in its standard form as s_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2. There is really nothing mysterious anymore, its just that some terms having been omitted from the sum because they were non-inofrmative. Second, as I will show in my forthcoming blog post on PCA, the fact that the variance of a random variable is half of the expectation of the squared difference of two instances, is key to understanding the connection between multi-dimensional scaling (MDS) and PCA. In other words, as my student Nicolas Bray is fond of saying, although most people think a proof is either right or wrong, in fact some proofs are more right than others. The connection between Bessel’s correction and PCA goes even deeper: as explained by Saville and Wood in their book Statistical Methods: A Geometric Approach n-1 can be understood to be a reduction in one dimension from the point of view of probabilistic PCA (Saville and Wood do not explicitly use the term probabilistic PCA but as I will explain in my PCA post it is implicit in their book). Finally, there are many subtleties to Bessel’s correction, for example it is an unbiased estimator for variance and not standard deviation. These issues ought to be mentioned in a good lecture about the topic. In other words, the Khan lecture is neither necessary nor sufficient, but unlike a standard lecture where the damage is limited to a small audience of students, it has been viewed more than 50,000 times and those views cannot be unviewed.

In writing this blog post I pondered the irony of my call for added diversity in teaching while I preach my own idea (this post) to a large number of readers via a medium designed for maximal outreach. I can only ask that others blog as well to offer alternative points of view 🙂 and that readers inform themselves on the issues I raise by fact-checking elsewhere. As far as the statistics goes, if someone finds the post confusing, they should go and register for one of the many fantastic MOOCs on statistics! But I reiterate that in the rush to MOOCdom, providers must offer diversity in their offerings (even multiple lectures on the same topic) to ensure a healthy population of memes. This is especially true in Spain, where already inbred faculty are now inbreeding what they teach by MOOCing via Miriada X. Half of the MOOCs being offered in Spain originate from just 3 universities, while the number of potential viewers is enormous as Spanish is now the second most spoken language in the world (thanks to Charles II’s great-great-grandfather, Charles I).

May Charles II rest in peace.

On Satuday I submitted the final grades for Math10A, the new UC Berkeley freshman math class for intended biology majors that I taught this semester. In assigning students their grades, I had a chance to reflect again on the system we use and its substantial shortcomings.

The system is broken, and my grade assignment procedure illustrates why. Math 10a had 223 students this semester, and they were graded according to the following policy: homework 10%, quizzes 10%, midterms 20% each (there were two midterms) and the final 40%. If midterm 1 was skipped then midterm 2 counted 40%. Similarly, if midterm 2 was skipped then the final counted 60%. This produced a raw score for each student and the final distribution is shown below (zeroes not shown):


The distribution seems fairly “reasonable”. One student didn’t do any work or show up and got a 5/100. At the other end of the spectrum some students aced the class. The average score was 74.48 and  the standard deviation 15.06. An optimal raw score distribution should allow for detailed discrimination between students (e.g. if everyone gets the same score thats not helpful). I think my distribution could have been a bit better but I overall I am satisfied with it.  The problem comes with the next step: after obtaining raw scores in a class, the professor has to set cutoffs for A+/A/A-/B+/B/B-/C+/C/C-/D+/D/D-/F. Depending on how the cutoffs are set, the grade distribution can change dramatically. In fact, it is easy to see that any discrete distribution on letter grades is achievable from any raw score distribution. One approach to letter grades would be to fix an A at, say, any raw score greater than or equal 90%, i.e., no curving. I found that threshold on wikipedia. But that is rarely how grades are set, partly because of large variability in the difficulty of exams. Almost every professor I know “curves” to some extent. At Berkeley one can examine grade distributions here.

It turns out that Roger Purves from statistics used to aim for a uniform distribution:


Roger Purves’ Stat 2 grade distribution over the past 6 years.

The increase in C- grades is explained by an artifact of the grading system at Berkeley.  If a student fails the class they can take it again and record the passing grade for their GPA (although the F remains on the transcript). A grade of D is not only devastating for the GPA, but also permanent. It cannot be improved by retaking the class. Therefore many students try to fail when they are doing poorly in a class, and many professors simply avoid assigning Ds. In other words, Purves’ C- includes his Ds. Another issue is that an A+ vs. A does not affect GPA, but an A vs. A- does; the latter is obviously a very subjective difference that varies widely between classes and professors. Note that Roger Purves just didn’t assign A+ grades, presumably because they have no GPA significance (although they do arguably have a psychological impact).

Marina Ratner from math failed more students [Update November 9, 2014: Prof. Ratner has pointed out to me that she receives excellent reviews from students on Ratemyprofessors, while explaining that “the large number of F in my classes are due to the large number of students who drop the class but are still on the list or don’t do any work” and that “One of the reasons why my students learned and others did not was precisely because of my grading policy.”]. Her grade distribution for Math 1b in the Spring semester of 2009 is below:


Marina Ratner’s Math 1B, Spring 2009.

In the same semester, in a parallel section, her colleague Richard Borcherds gave the following grades:


Richard Borcherd’s Math 1B, Spring 2009.

Unlike Ratner, Borcherds appears to be averse to failing students. Only 7 students failed out of 441 who were enrolled in his two sections that semester. Fair?

And then there are those who believe in the exponential distribution, for example Heino Nitsche who teaches Chem 1A:


Heino Nitsche’s Chem 1A, Spring 2011.

The variability in grade assignment is astonishing. As can be seen above, curving is prevalent and arbitrary, and the idea that grades have an absolute meaning is not credible. It is statistically highly unlikely that Ratner’s students were always terrible at learning math (whereas Borcherds “luckily” got the good students). Is chemistry inherently easy, to the point where an average student taking the class deserves an A?

This messed up system is different, yet similar in other schools. Sadly, many schools have used letter grading to manipulate GPAs via continual grade inflation. Just three weeks ago on December 3rd, the dean of undergraduate education at Harvard confirmed that the median grade at Harvard is an A- and the most common grade an A. The reasons for grade inflation are manifold. But I can understand it on a personal level. It is tempting for a faculty member to assign As because those are likely to immediately translate to better course evaluations (both internal, and public on sites such as Ninja Courses and ratemyprofessor). Local grade inflation can quickly lead to global inflation as professors, and at a higher level their universities, are competing with each other for the happiest students.

How did I assign letter grades for Math 10A?

After grading the final exams together, my five GSIs started the process of setting letter grade thresholds by examining the grades of “yardstick students”. These were students for which the GSIs felt confident in declaring their absolute knowledge of the material to be at the A,B,C or F levels. We then proceeded to refine the thresholds adding +/- cutoffs by simultaneously trying to respect the yardsticks, while also keeping in mind the overall grade distribution. Finally, I asked the GSIs to consider moving students upward across thresholds if they had shown consistent improvement and commitment throughout the semester (a policy I had informed the students of in class). The result was that about 40% of the students ended with an A. Students failed the class at a threshold where we believed they had not learned enough of the material to proceed to math 10B. I have to say that as far as my job goes, assigning letter grades for courses is the least scientific endeavor I participate in.

What should be done?

Until recently grades were recorded on paper, making it difficult to perform anything but trivial computations on the raw scores or letter grades. But electronic recording of grades allows for more sophisticated analysis. This should be taken advantage of. Suppose that instead of a letter grade, each student’s raw scores were recorded, along with the distribution of class scores. A single letter would immediately be replaced by a meaningful number in context.

I do think it is unfair to grade students only relatively, especially with respect to cohorts that can range in quality. But it should be possible to compute a meaningful custom raw score distribution specific to individual students based on the classes they have taken. The raw data is a 3-way table whose dimensions consist of professors x classes x raw scores. This table is sparse, as professors typically only teach a handful of different courses throughout their career. But by properly averaging the needed distributions as gleaned from this table, it should be possible to produce for each student an overall GPA score, together with a variance of the (student specific) distribution it came from averaged over the courses the student took. The resulting distribution and score could be renormalized to produce a single meaningful number. That way, taking “difficult” classes with professors who grade harshly would not penalize the GPA. Similarly, aiming for easy As wouldn’t help the GPA. And manipulative grade inflation on the part of professors and institutions would be much more difficult.

Its time to level the playing field for students, eliminate the possibility for manipulative grade inflation, and to stop hypocrisy. We need to not only preach statistical and computational thinking to our students, we need to practice it in every aspect of their education.

[Update April 6, 2014: The initial title of this post was “23andme genotypes are all wrong”. While that was and remains a technically correct statement, I have changed it because the readership of my blog, and this post in particular, has changed. Initially, when I made this post, the readers of the blog were (computational) biologists with extensive knowledge of genotyping and association mapping, and they could understand the point I was trying to make with the title. However in the past few months the readership of my blog has grown greatly, and the post is now reaching a wide public audience. The revised title clarifies that the content of this post relates to the point that low error rates in genotyping can be problematic in the context of genome-wide association reports because of multiple-testing.]

I have been reading the flurry of news articles and blog posts written this week about 23andme and the FDA with some interest. In my research talks, I am fond of displaying 23andme results, and have found that people always respond with interest. On the teaching side, I have subsidized 23andme testing for volunteer students in Math127 who were interested in genetics so that they could learn about personalized genomics first-hand. Finally, a number of my former and current students have worked at 23andme, and some are current employees.

Despite lots of opinions being expressed about the 23andme vs. FDA kerfuffle, I believe that two key points have been ignored in the discussions:

  1. All 23andme genotypes that have ever been reported to customers are wrong. This is the case despite very accurate genotyping technology used by 23andme.
  2. The interpretation of 23andme results involves examining a large number of odds ratios. The presence of errors leads to a huge multiple-testing problem.

Together, these issues lead to an interesting conundrum for the company, for customers, and for the FDA.

I always find it useful to think about problems concretely. In the case of 23andme, it means examining actual genotypes. Fortunately, you don’t have to pay the company $99 dollars to get your own- numerous helpful volunteers have posted their 23andme genotypes online. They can be viewed at where “customers of direct-to-customer genetic tests [can] publish their test results, find others with similar genetic variations, learn more about their results, get the latest primary literature on their variations and help scientists find new associations”. There are a total of 624 genotypes available at openSNP, many of them from 23andme. As an example, consider “samantha“, who in addition to providing her 23andme genotype, also provides lots of phenotypic information. Here is the initial part of her genotype file:

# This data file generated by 23andMe at: Wed Jul 20 20:37:11 2011
# Below is a text version of your data. Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier 
# (an rsid or an internal id), its location on the reference human genome, and the 
# genotype call oriented with respect to the plus strand on the human reference 
# sequence.     We are using reference human assembly build 36.  Note that it is possible 
# that data downloaded at different times may be different due to ongoing improvements 
# in our ability to call genotypes. More information about these changes can be found at:
# More information on reference human assembly build 36:
# rsid	chromosome	position	genotype
rs4477212	1	72017	AA
rs3094315	1	742429	AG
rs3131972	1	742584	AG
rs12124819	1	766409	AA
rs11240777	1	788822	AA
rs6681049	1	789870	CC
rs4970383	1	828418	CC
rs4475691	1	836671	CC
rs7537756	1	844113	AA
rs13302982	1	851671	GG
rs1110052	1	863421	GT

Anyone who has been genotyped by 23andme can get this file for themselves from the website (by clicking on their name, then on “Browse Raw Data” from the pull-down menu, and then clicking on “Download” in the top-right corner of the browser window). The SNPs are labeled with rsid labels (e.g. rs3094315) and correspond to specific locations on chromosomes (e.g. chr1:742429). Since every human is diploid, two bases are shown for every SNP; one came from mom and one from dad. The 23andme genotype is not phased, which means that you can’t tell in the case of rs3094315 whether the A was from mom and the G from dad, or vice versa (it turns out paternal origin can be important, but that is a topic for another post).

A key question the FDA has asked, as it does for any diagnostic test, is whether the SNP calls are accurate. The answer is already out there. First, someone has performed a 23andme replicate experiment precisely to assess the error rate. In an experiment in 2010 with two replicates, 85 SNPs out of about 600,000 were different. Today, Illumina types around 1 million SNPs, so one would expect even more errors. Furthermore, a replicate analysis provides only a lower bound, since systematic errors will not be detected. Another way to examine the error rate is to look at genotypes of siblings. That was written about in this blog post which concluded there were 87 errors. 23andme currently uses the Illumina Omni Express for genotyping, and the Illumina spec sheet claims a similar error rate to those inferred in the blog posts mentioned above. The bottom line is that even though the error rate for any individual SNP call is very very low (<0.01% error), with a million SNPs being called there is (almost) certainly at least one error somewhere in the genotype. In fact, assuming a conservative error rate leading to an average of 100 errors per genotype, the probability that a 23andme genotype has no errors is less than 10^(-40).

The fact that 23andme genotypes are wrong (i.e. at least one error in some SNP) wouldn’t matter if one was only interested in a single SNP. With very high probability, it would be some other SNPs that are the wrong ones. But the way people use 23andme is not to look at a single SNP of interest, but rather to scan the results from all SNPs to find out whether there is some genetic variant with large (negative) effect. The good news is that there isn’t much information available for the majority of the 1 million SNPs being tested. But there are, nevertheless, lots of SNPs (thousands) to look at. Whereas a comprehensive exam at a doctor’s office might currently constitute a handful of tests– a dozen or a few dozen at most– a 23andme test assessing thousands of SNPs and hundreds of diseases/traits constitutes more diagnostic tests on an individual at one time than have previously been performed in a lifetime.

To understand how many tests are being performed in a 23andme experiment, it is helpful to look at the Interpretome website. The website allows a user to examine information on SNPs without paying, and without uploading the data. I took a look at Samantha, and the Interpretome gave information about 2829 SNPs. These are SNPs for which there is a research article that has identified the SNP as significant in some association study (the website conveniently provides direct links to the articles). For example, here are two rows from the phenotype table describing something about Samantha’s genetic predisposition for large head circumference:

Head circumference (infant) 11655470  CC  T  .05    4E-6    22504419
Head circumference (infant) 1042725    CC  T  .07    3E-10  22504419

Samantha’s genotype at the locus is CC, the “risk” allele is T, the odds ratios  are very small (0.05,0.07) and the p-values are apparently significant. Interpretome’s results differ from those of 23andme, but looking at the diversity of phenotypes reported on gives one a sense for the possibilities that currently exist in genetics, and the scope of 23andme’s reports.

From the estimates of error rates provided above, and using the back of an envelope, it stands to reason that about 1/3 of 23andme tested individuals have an error at one of their “interesting” SNPs. Not all of SNPs arising in association studies are related to diseases, but many of them are. I don’t think its unreasonable to postulate that a significant percentage of 23andme customers have some error in a SNP that is medically important. Whether such errors are typically false positives or false negatives is unclear, and the extent to which they may lead to significant odds ratios is another interesting question. In other words, its not good enough to know how frequently warfarin sensitivity is being called incorrectly. The question is how frequently some medically significant result is incorrect.

Of course, the issue of multiple testing as it pertains to interpreting genotypes is probably a secondary issue with 23andme. As many bloggers have pointed out, it is not even clear that many of 23andme’s odds ratios are accurate or meaningful. A major issue, for example, is the population background of an individual examining his/her genotype and how close it is to the population on which the GWAS were performed. Furthermore, there are serious questions about the meaning of the GWAS odds ratios in the case of complex traits. However I think the issue of multiple testing is a deeper one, and a problem that will only be exacerbated as more disease SNPs are identified. Having said that, there are also approaches that could mitigate errors and improve fidelity of the tests. As DECODE genetics has demonstrated, imputation and phasing can in principle be used to infer population haplotypes, which not only are useful for GWAS analyses, but can also be used to identify erroneous SNP calls. 23andme’s problem is that although they have many genotypes, they are from diverse populations that will be harder to impute and phase.

The issue of multiple testing arising in the context of 23andme and the contrast with classic diagnostics reminds me of the dichotomy between whole-genome analysis and classic single gene molecular biology. The way in which customers are looking at their 23andme results is precisely to look for the largest effects, i.e. phenotypes where they appear to have high odds of contracting a disease, or being sensitive to some drug. This is the equivalent of genome scientists picking the “low hanging fruit” out of genome-wide experiments such as those performed in ENCODE. In genomics, scientists have learned (with some exceptions)  how to interpret genome-wide analyses after correcting for multiple-hypothesis testing by controlling for false discovery rate. But are the customers of 23andme doing so? Is the company helping them do it? Should it? Will the FDA require it? Can looking at ones own genotype constitute too much testing?

There are certainly many precedents for superfluous harmful testing in medicine. For example, the American Academy of Family Physicians has concluded that prostate cancer PSA tests and digital rectal exams have marginal benefits that are outweighed by the harm caused by following up on positive results. Similar arguments have been made for mammography screening. I therefore think that there are serious issues to consider about the implications of direct-to-consumer genetic testing and although I support the democratization of genomics, I’m glad the FDA is paying attention.


Samantha’s type 2 diabetes risk as estimated from her genotype by Interpretome. She appears to have a lower risk than an average person. Does this make it ok for her to have another cookie?

I visited Duke’s mathematics department yesterday to give a talk in the mathematical biology seminar. After an interesting day meeting many mathematicians and (computational) biologists, I had an excellent dinner with Jonathan Mattingly, Sayan MukherjeeMichael Reed and David Schaeffer. During dinner conversation, the topic of probability theory (and how to teach it) came up, and in particular Buffon’s needle problem.

The question was posed by Georges-Louis Leclerc, Comte de Buffon in the 18th century:

Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips?

If the strips are distance t apart, and l \leq t, then it is easy to see that the probability P is given by

P = \int_{\theta =0}^{\frac{\pi}{2}} \int_{x = 0}^{\frac{l}{2}sin \theta} \frac{4}{t \pi} dx d\theta = \frac{2l}{t \pi}.

The appearance of \pi in the denominator turns the problem into a Monte Carlo technique for estimating \pi: simply simulate random needle tosses and count crossings.

It turns out there is a much more elegant solution to the problem– one that does not require calculus. I learned of it from Gian-Carlo Rota when I was a graduate student at MIT. It appears in his book Introduction to Geometric Probability (with Dan Klain) that I have occasionally used when teaching Math 249. The argument relies on the linearity of expectation, and is as follows:

Let f(l) denote the expected number of crossings when a needle of length l is thrown on the floor. Now consider two needles, one of length l and the other m, attached to each other end to end (possibly at some angle). If X_1 is a random variable describing the number of crossings of the first needle, and X_2 of the second, its certainly the case that X_1 and X_2 are dependent, but because expectation is linear, it is the case that E(X_1+X_2) = E(X_1)+E(X_2). In other words, the total number of crossings is, in expectation, f(l)+f(m).


Buffon’s needle problem: what is the probability that a needle of length l \leq t crosses a line? (A) A short needle being thrown at random on a floor with parallel lines. (B) Two connected needles. The expected number of crossings is proportional to the sum of their lengths. (C) A circle of diameter always crosses exactly two lines.

It follows that f is a linear function, and since f(0)=0, we have that f(l) = cl where c is some constant. Now consider a circle of diameter t. Such a circle, when thrown on the floor, always crosses the parallel lines exactly twice. If C is a regular polygon with vertices on the circle, and the total length of the polygon segments is l, then the total number of crossings is f(l). Taking the limit as the number of segments in the polygon goes to infinity, we find that f(t \pi ) = 2. In other words,

f(t \pi) = c \cdot t \pi = 2 \Rightarrow c = \frac{2}{t \pi},

and the expected number of crossings of a needle of length l is \frac{2l}{t \pi}. If l < t, the number of crossings is either 0 or 1, so the expected number of crossings is, by definition of expectation, equal to the probability of a single crossing. This solves Buffon’s problem no calculus required!

The linearity of expectation appears elementary at first glance. The proof is simple, and it is one of the first “facts” learned in statistics– I taught it to my math 10 students last week. However the apparent simplicity masks its depth and utility; the above example is cute, and one of my favorites, but linearity of expectation is useful in many settings. For example I recently saw an interesting application in an arXiv preprint by Anand Bhaskar, Andy Clark and Yun Song on “Distortion of genealogical properties when the sample is very large“.

The paper addresses an important question, namely the suitability of the coalescent as an approximation to discrete time random mating models, when sample sizes are large. This is an important question, because population sequencing is starting to involve hundreds of thousands, if not millions of individuals.

The results of Bhaskar, Clark and Song are based on dynamic programming calculations of various genealogical quantities as inferred from the discrete time Wright-Fisher model. An example is the expected frequency spectrum for random samples of individuals from a population. By frequency spectrum, they mean, for each k, the expected number of polymorphic sites with k derived alleles and n-k ancestral alleles under an infinite-sites model of mutation in a sample of n individuals. Without going into details (see their equations (8),(9) and (10)), the point is that they are able to derive dynamic programming recursions because they are computing the expected frequencies, and the linearity of expectation is what allows for the derivation of the dynamic programming recursions.

None of this has anything to do with my seminar, except for the fact that the expectation-maximization algorithm did make a brief appearance, as it frequently does in my lectures these days. I spoke mainly about some of the mathematics problems that arise in comparative transcriptomics, with a view towards a principled approach to comparing transcriptomes between cells, tissues, individuals and species.


The Duke Chapel. While I was inside someone was playing the organ, and as I stared at the ceiling, I could have sworn I was in Europe.

The International Society for Clinical Densitometry has an official position on the number of decimal digits that should be reported when describing the results of bone density scans:

  • BMD (bone mineral
    density): three digits (e.g., 0.927 g/cm2). •
  • T-score: one digit (e.g., –2.3).
  •  Z-score: one digit (e.g., 1.7).
  •  BMC (bone mineral content): two digits (e.g., 31.76
  •  Area: two digits (e.g., 43.25
  •  % reference database: Integer
    (e.g., 82%).
Are these recommendations reasonable? Maybe not. For example they fly in the face of the recommendation in the “seminal” work of Ehrenberg (Journal of the Royal Statistical Society A, 1977)  which is to use two decimal digits.

Two? Three? What should it be? This what my Math10 students always ask of me.

I answered this question for my freshmen in Math10 two weeks ago by using an example based on a dataset from the paper Schwartz, M. “A biomathematical approach to clinical tumor growth“. Cancer 14 (1961): 1272-1294. The paper has a dataset consisting of the size of a pulmonary neoplasm over time:

TumorA simple model for the tumor growth is f(t) = a \cdot b^t and in class I showed how a surprisingly good fit can be obtained by interpolating through only two points (t=0 and t=208):

f(0)= 1.8 \Rightarrow a \cdot b^0 = 1.8 \Rightarrow a = 1.8.

Then we have that f(208) = 3.5 \Rightarrow 1.8 \cdot b^{208} = 3.5 \Rightarrow b= \sqrt{208}{3.5/1.8} \approx 1.00302.

The power function f(t)=1.8 \cdot t^{1.00302} is shown in the figure. The fit is surprisingly good considering it is based on an interpolation using only two points. The point of the example is that if one rounds the answer 1.00302 to two decimal digits then one obtains  f(t) = 1.8t^1 = 1.8t which is linear as opposed to super-linear. In other words, a small (quantitative) change in the assumptions (restricting the rate to intervals differing by 0.01) results in a major qualitative change in results:  with two decimal digits the patient lives, with three… death!

This simple example of decimal digit arithmetic illustrates a pitfall affecting many computational biology studies. It is tempting to believe that \mbox{Qualitative} \subset \mbox{Quantitative}, i.e. that focusing on qualitative analysis allows for the flexibility of ignoring quantitative assumptions.  However frequently the qualitative devil is in the quantitative details.

One field where qualitative results are prevalent, and therefore the devil strikes frequently, is network theory. The emphasis on searching for “universal phenomena”, i.e. qualitative results applicable to networks arising in different contexts, arguably originates with Milgram’s small world experiment that led to the concept of “six-degree of separation” and Watts and Strogatz’s theory of collective dynamics in small-world networks (my friend Peter Dodds replicated Milgram’s original experiment using email in “An experimental study of search in global social networks“, Science 301 (2003), p 827–829) . In mathematics these ideas have been popularized via the Erdös number which is the distance between an author and Paul Erdös in a graph where two individuals are connected by an edge if they have published together. My Erdös number is 2, a fact that is of interest only in that it divulges my combinatorics roots. I’m prouder of other connections to researchers that write excellent papers on topics of current interest. For example, I’m pleased to be distance 2 away from Carl Bergstrom via my former undergraduate student Frazer Meacham (currently one of Carl’s Ph.D. students) and the papers:

  1. Meacham, Frazer, Dario Boffelli, Joseph Dhahbi, David IK Martin, Meromit Singer, and Lior Pachter. “Identification and Correction of Systematic Error in High-throughput Sequence Data.” BMC Bioinformatics 12, no. 1 (November 21, 2011): 451. doi:10.1186/1471-2105-12-451.
  2. Meacham, Frazer, Aaron Perlmutter, and Carl T. Bergstrom. “Honest Signaling with Costly Gambles.” Journal of the Royal Society Interface 10, no. 87 (October 6, 2013):20130469. dpi:10.1098/rsif.2013.0469.
One of Bergstrom’s insightful papers where he exposes the devil (in the quantitative details) is “Nodal Dynamics, Not Degree Distributions, Determine the Structural Controllability of Complex Networks”  by Cowan et al., PLoS One 7 (2012), e88398. It describes a not-so-subtle example of an unreasonable quantitative assumption that leads to intuition about network structural controllability that is, to be blunt, false. The example Carl critiques is from the paper

Controllability of complex networks” by Yang-Yu Liu, Jean-Jacques Slotine and Albert-László Barabási, Nature 473 (2011), p 167–173. The mathematics is straightforward: It concerns the dynamics of linear systems of the form

\frac{d{\bf x}(t)}{dt} =-p{\bf x}(t) + A{\bf x}(t) + B{\bf u}(t) .

The dynamics can be viewed as taking place on a graph whose adjacency matrix is given by the non-zero entries of A (an nxn matrix). The vector -p (of size nis called the pole of the linear system and describes intrinsic dynamics at the nodes. The vector (of size mcorresponds to external inputs that are coupled to the system via the nxm matrix B.

An immediate observation is that the vector p is unnecessary and can be incorporated into the diagonal of the matrix A. An element on the diagonal of A that is then non-zero can be considered to be a self-loop. The system then becomes

\frac{d{\bf x}(t)}{dt} =A{\bf x}(t) + B{\bf u}(t)

which is the form considered in the Liu et al. paper (their equation(1)). The system is controllable if there are time-dependent u that can drive the system from any initial state to a  target end state. Mathematically, this is equivalent to asking whether the matrix C=(B,AB,A^2B,\ldots, A^{n-1}B) has full rank (a classic result known as Kalman’s criteria of controllability). Structural controllability is a weaker requirement, in which the question is  whether given only adjacency matrices and B, there exists weights for edges so that the weighted adjacency matrices satisfy Kalman’s criteria. The point of structural controllability is a theorem showing that structurally controllable systems are generically controllable.

The Liu et al. paper makes two points: the first is that if M is the size of a maximum matching in a given nxn adjacency matrix A, then the minimum m for which there exists a matrix B of size nxm for which the system is structurally controllable is m=n-M+1 (turns out this first point had already been made, namely in a paper by Commault et al. from 2002). The second point is that m is related to the degree distribution of the graph A. 

The point of the Cowan et al. paper is to explain that the results of Liu et al. are completely uninteresting if a_{ii}  is non-zero for every i. This is because M is then equal to n (the matching matching of A consists of every self-loop). And therefore the result of Liu et al. reduces to the statement that m=1, or equivalently, that structural controllability for real-world networks can also be achieved with a single control input.

Unfortunately for Liu and coauthors, barring pathological canceling out of intrinsic dynamics with self-regulation, the diagonal elements of A, a_{ii} will be zero only if there are no intrinsic dynamics to the system (equivalently p_i=0 or the time constants \frac{1}{p_i} are infinite).  I repeat the obvious by quoting from Cowan et al.:

“However, infinite time constants at each node do not generally reflect the dynamics of the physical and biological systems in Table 1 [of Liu et al.]. Reproduction and mortality schedules imply species- specific time constants in trophic networks. Molecular products spontaneously degrade at different rates in protein interaction networks and gene regulatory networks. Absent synaptic input, neuronal activity returns to baseline at cell-specific rates. Indeed, most if not all systems in physics, biology, chemistry, ecology, and engineering will have a linearization with a finite time constant. ”

Cowan et al. go a bit further than simply refuting the premise and results of Liu et al. They avoid the naïve reduction of a system with intrinsic dynamics to one with self-loops, and provide a specific criterion for the number of nodes in the graph that must be controlled.

In summary, just as with the rounding of decimal digits, a (simple looking) assumption of Liu et al., namely that p=0 completely changes the qualitative nature of the result. Moreover, it renders false the thesis of the Liu et al. paper, namely that degree distributions in (real) networks affect the requirements for controllability.


I recently completed a term as director of our Center for Computational Biology and one of the things I am proud of having accomplished is helping Brian McClendon, program administrator for the Center, launch a new Ph.D. program in computational biology at UC Berkeley.

The Ph.D. program includes a requirement for taking a new course, “Classics in Computational Biology” (CMPBIO 201), that introduces students to research projects and approaches in computational biology via a survey of classic papers. I taught the class last week and the  classic paper I chose was  “Comparison of biosequences” by Temple Smith and Michael Waterman, Advances in Applied Mathematics 2 (1981), p 482–489.  I gave a preparatory lecture this past week in which I discussed the Needleman-Wunsch algorithm, followed by leading a class discussion about the Smith-Waterman paper and related topics. This post is an approximate transcription of my lecture on the Needleman-Wunsch algorithm:

The Needleman-Wunsch algorithm was published in “A general method applicable to the search for similarities in the amino acid sequence of two proteins“, Needleman and Wunsch, Journal of Molecular Biology 48 (1970), p 443–453. As with neighbor-joining or UniFrac that I just discussed in my previous blog post, the Needleman-Wunsch algorithm was published without an explanation of its meaning, or even a precise description of the dynamic programming procedure it is based on. In their paper 11 years later, Smith and Waterman write

“In 1970 Needleman and Wunsch [1] introduced their homology (similarity) algorithm. From a mathematical viewpoint, their work lacks rigor and clarity. But their algorithm has become widely used by the biological community for sequence comparisons.”

They go on to state that “The statement in Needleman-Wunsch [1] is less general, unclearly stated, and does not have a proof” (in comparison to their own main Theorem 1). The Theorem Smith and Waterman are referring to explicitly describes  the dynamic programming recursion only implicit in Needleman-Wunsch, and then explains why it produces the “optimal” alignment. In hindsight, the Needleman-Wunsch algorithm can be described formally as follows:

Given two sequences and b of lengths n_1 and n_2 respectively, an alignment A(a,b) is a sequence of pairs:

[(a_{i_1},b_{j_1}),(a_{i_2},b_{j_2}),\ldots : 1 \leq i_1 < i_2 < \cdots \leq n_1, 1 \leq j_1 < j_2 \cdots \leq n_2 ].

Given an alignment A(a,b), we let

M = | \{ k: a_{i_k} = b_{j_k} \} |, X = | \{ k: a_{i_k} \neq b_{j_k} \} |


S = | \{ r:a_r \neq i_k \, \mbox{for any k} \} | + | \{ r:b_r \neq j_k \, \mbox{for any k} \} |.

In other words, M counts the number of matching characters in the alignment, X counts the number of mismatches and S the number of “spaces”, i.e. characters not aligned to another character in the other sequence. Since every character must be either matched, mismatched, or a not aligned, we have that

2 M + 2 X + S = n_1+n_2.

Given scores (parameters) consisting of real numbers m,x,s, the Needleman-Wunsch algorithm finds an alignment that maximizes the function m \cdot M + x \cdot X + s \cdot S. Note that by the linear relationship between M,X and described above, there are really only two free parameters, and without loss of generality we can assume they are and x. Furthermore, only and are relevant for computing the score of the alignment; when the scores are provided statistical meaning, one can say that and X are sufficient statistics. We call them the summary of the alignment.

The specific procedure of the Needleman-Wunsch algorithm is to compute a matrix S recursively:

S(i,j) = \mbox{max} \{ S_{i-1,j} + s,S_{i,j-1} + s, S_{i-1,j-1} + xI(a_{i} = b_{j} )\}

where I(a_i = b_j) is the indicator function that is 1 if the characters are the same, and zero otherwise, and an initialization step consists of setting S(i,0)=s \cdot i and S(0,j) = s \cdot j for all i,j. There are numerous generalizations and improvements to the basic algorithm, both in extending the scoring function to include more parameters (although most generalizations retain the linearity assumption), and algorithms for trading off time for space improvements (e.g. divide-and-conquer can be used to improve the space requirements from O(n_1 n_2) to O(n_1 + n_2)).

An important advance in understanding the Needleman-Wunsch algorithm was the realization that the “scores” m,x and s can be interpreted as logarithms of probabilities if the sign on the parameters is the same, or as log-odds ratios if the signs are mixed. This is discussed in detail in the book “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Sequences” by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison.

One insight from the statistical point of view is that if  max and plus in the recursion are replaced with plus and times, and the logarithms of probabilities are replaced by the probabilities themselves, then the Needleman-Wunsch recursion computes the total probability of the pairs of sequences marginalized over alignments (i.e., it is the forward algorithm for a suitably defined hidden Markov model). Together with the backward algorithm, which consists of running the forward algorithm on the reversed sequences, this provides an efficient way to compute for every pair of nucleotides the posterior probability that they are aligned. In other words, there is a meaningful interpretation and useful application for the Needleman-Wunsch algorithm with the semi-ring (\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +) replaced with (\mathbb{R}, \oplus := +, \otimes := \times).

In work in collaboration with Bernd Sturmfels, we realized that it also makes sense to run the Needleman-Wunsch algorithm with the  polytope algebra (\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum}). Here \mathcal{P} is  the set of convex polytopes in \mathbb{R}^d for some d, polytopes are “added” using the geometric operation of the convex hull of the union of polytopes, and they are “multiplied” by taking Minkowski sum. The result allows one to find not only a single optimal alignment, but all optimal alignments.  The details are discussed in a pair of papers:

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138–16143.

C. Dewey, P. Huggins, K, Woods, B. Sturmfels and L. Pachter, Parametric alignment of Drosophila genomes, PLoS Computational Biology, Volume 2, Number 6 (2006) p e73.

The first explains the mathematical foundations for using the result and is where we coined the term polytope propagation algorithm for the Needleman-Wunsch algorithm with the polytope algebra. The second paper illustrates an application to a whole-genome alignment of two Drosophilids and provides an alternative to polytope propagation (the beneath-beyond algorithm) that is much faster. The type of analysis performed in the paper is also called parametric alignment, a problem with a long history whose popularity was launched by Dan Gusfield who provided the first software program for parametric alignment called XPARAL.

The polytope propagation algorithm (Needleman-Wunsch with the polytope algebra) is illustrated in the figure below:


The convex polygon in the bottom right panel contains in its interior points corresponding to all possible summaries for the alignments of the two sequences. The optimal summaries are vertices of the polygon. For each vertex, the parameters for which the Needleman-Wunsch algorithm will produce a summary corresponding to that vertex form a cone. The cones together form a fan which is shown inside the polygon.

The running time of polytope propagation depends on the number of vertices in the final polytope. In our papers we provide bounds showing that polytope propagation is extremely efficient. In particular, for parameters, the number of vertices is at most O(n^{\frac{d(d-1)}{(d+1)}})Cynthia Vinzant, formerly in our math department at UC Berkeley, has written a nice paper on lower bounds.

The fact that the Needleman-Wunsch algorithm can be run with three semi-rings means that it is particularly well suited to C++ thanks to templates that allow for the variables in the recursion to be abstracted. This idea (and code) is thanks to my former student Colin Dewey:

template<typename SemiRing>

alignGlobalLastRow(const string& seq1,
                   const string& seq2,
                   const typename SemiRing::Element match,
                   const typename SemiRing::Element mismatch,
                   const typename SemiRing::Element gap,
                   vector& row);
const Element one = SemiRing::multiplicativeIdentity;
// Initialize row
row.resize(seq2.size() + 1);
row[0] = one;
// Calculate first row
for (size_t j = 1; j <= seq2.size(); ++j) 
    row[j] = gap * row[j - 1];
// Calculate remaining rows
Element up, diag;
for (size_t i = 1; i <= seq1.size(); ++i) {
    diag = row[0];
    row[0] *= gap;
    for (size_t j = 1; j <= seq2.size(); ++j) {
        up = row[j];
        if (seq1[i - 1] == seq2[j - 1]) {
            row[j] = match * diag + gap * (up + row[j - 1]);
        } else {
            row[j] = mismatch * diag + gap * (up + row[j - 1]);
        diag = up;

To recap, the three semi-rings with which it makes sense to run this algorithm are:

  1. (\mathbb{R}, \oplus := +, \otimes := \times)
  2. (\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +)
  3. (\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum})

The interpretation of the output of the algorithm with the three semi-rings is:

  1. S_{ij} is the score (probability) of a maximum alignment between the ith prefix of one sequence and the jth prefix of the other.
  2. S_{ij} is the total score (probability) of all alignments between the ith prefix of one sequence and the jth prefix of the other.
  3. S_{ij} is a polytope whose faces correspond to summaries that are optimal for some set of parameters.

The semi-rings in (2) and (3) lead to particularly interesting and useful applications of the Needleman-Wunsch algorithm. An example of the use of (2), is in posterior decoding based alignment where nucleotides are aligned as to maximize the sum of the posterior probabilities computed by (2). In the paper “Alignment Metric Accuracy” (with Ariel Schwartz and Gene Myers) we  provide an interpretation in terms of an interesting metric on sequence alignments and in

A.S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics 23 (2007), e24–e29

we show how to adapt posterior decoding into a (greedy) multiple alignment algorithm. The video below shows how it works, with each step consisting of aligning a single pair of nucleotides from some pair of sequences using posterior probabilities computed using the Needleman-Wunsch algorithm with semi-ring (2):

We later extended this idea in work together with my former student Robert Bradley in what became FSA published in the paper “Fast Statistical Alignment“, R. Bradley et al., PLoS Computational Biology 5 (2009), e1000392.

An interesting example of the application of Needleman-Wunsch with semi-ring (3), i.e. polytope propagation, is provided in the recent paper “The RNA Newton polytope and learnability of energy parameters” by Elmirasadat Forouzmand and Hamidreza Chitsaz, Bioinformatics 29 (2013), i300–i307. Using our polytope propagation algorithm, they investigate whether simple models for RNA folding can produce, for some set of parameters, the folds of sequences with solved structures (the answer is sometimes, but not always).

It is remarkable that forty three years after the publication of the Needleman-Wunsch paper, and thirty two years after the publication of the Smith-Waterman paper, the algorithms remain pertinent, widely used, and continue to reveal new secrets. True classics.

Where are the authors now?

Saul B. Needleman writes about numismatics.

Christian Wunsch is a clinical pathologist in the University of Miami health system.

Temple Smith is Professor Emeritus of Biomedical Engineering at Boston University.

Michael Waterman is Professor of Biological Sciences, Mathematics and Computer Science at the University of Southern California.

Blog Stats

%d bloggers like this: